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Abstract 

Multimedia processing with emphasis on graphics is becoming increasingly important with wide 
variety of applications ranging from multimedia cell phones to high definition interactive 
television. Media processing involves the capture, storage, manipulation and transmission of 
multimedia objects such as text, handwritten data, audio objects, still images, 2D/3D graphics, 
animation and full-motion video. A number of implementation strategies have been proposed for 
processing multimedia data. These approaches can be broadly classified based on the evolution 
of processing architectures and the functionality of the processors. In order to provide media 
processing solutions to different consumer markets, designers have combined some of the 
classical features from both the functional and evolution based classifications resulting in many 
hybrid solutions. Multimedia and Graphics applications are computationally intensive and have 
been traditionally solved in 3 different ways. One is through the use of a high speed general 
purpose processor with accelerator support, which is essentially a sequential machine with 
enhanced instruction set architecture. Here the overlaying software bears the burden of 
interpreting the application in terms of the limited tasks that the processor can execute 
(instructions) and schedule these instructions to avoid resource and data dependencies. The 
second is through the use of an Application Specific Integrated Circuit (ASIC) which is a 
completely hardware oriented approach, spatially exploiting parallelism to the maximum extent 
possible. The former, although slower, offers the benefit of hardware reuse for executing other 
applications. The latter, albeit faster and more power, area & time efficient for a specific 
application, offers poor hardware reutilization for other applications. The third is through 
specialized programmable processors such as DSPs and media processors. These attempt to 
incorporate the programmability of general purpose processors and provide some amount of 
spatial parallelism in their hardware architectures. 

The complexity, variety of techniques and tools, and the high computation, storage and I/O 
bandwidths associated with multimedia processing presents opportunities for reconfigurable 
processing to enables features such as scalability, maximal resource utilization and real-time 
implementation. The relatively new domain of reconfigurable solutions lies in the region of 
computing space that offers the advantages of these approaches while minimizing their 
drawbacks. Field Programmable Gate Arrays (FPGAs) were the first attempts in this direction. 
But poor on-chip network architectures lead to high reconfiguration times and power 
consumptions. Improvements over this design using Hierarchical Network architectures with 
RAM style configuration loading have lead to a factor of 2-4 times reduction in individual 
configuration loading times. But the amount of redundant and repetitive configurations still 
remains high. This is one of the important factors that leads to the large overall configuration 
times and high power consumption compared to ASIC or embedded processor solutions. We 
believe that designing processing elements based on identifying correlated compute intensive 
regions within each application and between applications will result in large amounts of 
processing in localized regions of the chip. This reduces the amount of reconfigurations and 
hence faster application switching. This will also reduce the amount of on-chip communication, 
which in turn helps reduce power consumption. Since applications can be represented as Control 
Data Flow Graphs (CDFGs) such a pre-processing analysis lies in the area of pattern matching, 
specifically graph matching. In this context we propose a reduced complexity, yet exhaustive 
enough graph matching algorithm. We further propose to reduce the amount of on-chip 
communication by adopting reconfiguration aware static scheduling to manage task and resource 
dependencies on the processor. This is complemented by a divide and conquer approach which 
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helps in the allocation of an appropriate number of processing units aimed towards achieving 
uniform resource utilization. 

To validate the success of this approach, we will obtain estimates of reconfiguration times and 
also show the potential for reduction in power consumptions, by performing the experiment of 
identifying correlated-compute intensive regions on an assorted set of algorithms taken from the 
media standards such as MPEG-4 and frequently used graphics algorithms. 

1 Introduction 

A variety of media processing techniques are typically used in multimedia processing 
environments to capture, store, manipulate and transmit multimedia objects such as text, 
handwritten data, audio objects, still images, 2D/3D graphics, animation and full-motion video. 
Example techniques include speech analysis and synthesis, character recognition, audio 
compression, graphics animation, 3D rendering, image enhancement and restoration, 
image/video analysis and editing, and video transmission. Multimedia computing presents 
challenges from the perspectives of both hardware and software. For example, multimedia 
standards such as MPEG-1, MPEG-2, MPEG-4, MPEG-7, H.263 and JPEG 2000 involve 
execution of complex media processing tasks in real-time. The need for real-time processing of 
complex algorithms is further accentuated by the increasing interest in 3-D image and 
stereoscopic video processing. Each media in a multimedia environment requires different 
processes, techniques, algorithms and hardware. The complexity, variety of techniques and tools, 
and the high computation, storage and I/O bandwidths associated with multimedia processing 
presents opportunities for reconfigurabie processing to enables features such as scalability, 
maximal resource utilization and real-time implementation. 

To demonstrate the potential for reconfiguration in multimedia computations, we have performed 
a detailed complexity analysis of the recent multimedia standard (MPEG-4) [80], which we 
believe involves multiple media and encompasses a wide range of operations typically found in 
media processing. The results of our analysis show that there are significant variations in the 
computational complexity among the various modes/operations of MPEG-4. This points to the 
potential for extensive opportunities for exploiting reconfigurabie implementations of 
multimedia/ graphics algorithms. 

The availability of large, fast, FPGAs is making possible reconfigurabie implementations for a 
variety of applications. FPGAs consist of arrays of Configurable Logic Blocks (CLBs) that 
implement various logical functions. The latest FPGAs from vendors like Xilinx and Altera can 
be partially configured and run at several megahertz. Ultimately, computing devices may be able 
to adapt the underlying hardware dynamically in response to changes in the input data or 
processing environment and process real time applications. Thus FPGAs have established a point 
in the computing space which lies in between the dominant extremes of computing, ASICS and 
software programmable/ instruction set based architectures. There are three dominant features 
that differentiate reconfigurabie architectures from instruction set based programmable 
computing architectures & ASICs: (i) spatial implementation of instructions through a network 
of processing elements with the absence of explicit instruction fetch-decode model (ii) flexible 
interconnects which support task dependent data flow between operations (iii) ability to change 
the Arithmetic and Logic functionality of the processing elements. The reprogrammable space is 
characterized by the allocation and structure of these resources. Computational tasks can be 
implemented on a reconfigurabie device with intermediate data flowing fiom the generating 
function to the receiving function. The salient features of reconfigurabie machines are: 
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• Instructions are implemented through locally configured processing elements, thus allowing 
the reconfigurable device to effectively process more instructions into active silicon in each 
cycle. 

• Intermediate values are routed in parallel from producing functions to consuming functions 
(as space permits) rather than forcing all communication to take place through a central 
resource bottleneck. 

• Memory and interconnect resources are distributed and are deployed based on need rather 
than being centralized, hence presenting opportunities to extract parallelism at various levels. 

The networks connecting the Configuration Logic Blocks or Units (CLBs) or processing 
elements can range from full connectivity crossbar to neighbor only connecting mesh networks. 
The best characterization to date which empirically measures the growth in the interconnection 
requirements with respect to the number of Look-Up Tables (LUTs) is the Rent's rule which is 
given as follows: 

N IO = CN P gates 

Where N*° corresponds to the number of interconnections (in/out lines) in a region containing 
Ng^tes. C and p are empirical constants. For logical functions typically p ranges from 0.5<p<0.7. 
It has been shown (TJ (by building the FPGA based on Rent's model and using a hierarchical 
approach) that the configuration instruction sizes in traditional FPGAs are higher than necessary, 
by at least a factor of 2-4. Therefore for rapid configuration, off-chip context loading becomes 
slow due to the large amount of configuration data that must be transferred across a limited 
bandwidth I/O path. It is also shown that greater word widths increase wiring requirements, 
while decreasing switching requirements. In addition, larger granularity data paths can be used to 
reduce instruction overheads. The utility of this optimization largely depends on the granularity 
of the data which needs to be processed. However, if the architectural granularity is larger than 
the task granularity, the device's computational power will be under utilized. Another promising 
development in efforts to reduce configuration time is shown in [2]. The authors propose the use 
of a random access technique to selectively load a configurable logic unit on the processor. This 
adds some overheads in terms of address decoding circuitry, but is very low compared to the rest 
of the routing resources on the chip. It is shown to be efficient compared to the shift register style 
of programming the processor. This approach still doesn't reduce the number of configurations 
that need to be performed for migrating from one portion of an application to another portion in 
the same application or a different one. 

Most of the current approaches towards building a reconfigurable processor are targeted towards 
performance in terms of speed and are not tuned for power awareness or configuration time 
optimization. Therefore certain problems have surfaced that need to be addressed at the pre- 
processing phase. 

Firstly, the granularity or the processing ability of the Configurable Logic Units (CLUs) must be 
driven by the set of applications that are intended to be ported onto the processing platform. 
Some research groups have taken the approach of visual inspection [3], while others have 
adopted algorithms of exponential complexity [4,5] to identify regions in the application's DFGs 
that qualify for CLUs. None of the current approaches attempt to identify the regions through an 
automated low complexity approach that deals with CDFGs. 

Secondly, the number of levels in hierarchical network architecture must be influenced by the 
number of processing elements or CLUs needed to complete the task / application. This in turn 
depends on the amount of parallelism that can be extracted from the algorithm and the 
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percentage of resource utilization. To the best of our knowledge no research group in the area of 
reconfigurable computing has dealt with this problem. 

Thirdly, the complex network on the chip, makes dynamic scheduling expensive as it adds to the 
primary burden of power dissipation through routing resource utilization. Therefore there is a 
need for a reconfiguration aware scheduling strategy. Most research groups have adopted 
dynamic scheduling for a reconfigurable accelerator unit through a scheduler that resides on a 
host processor [6,7]. 

The increasing demand for fast processing, high flexibility and reduced power consumption 
naturally demand the design and development of a low configuration time aware-dynamically 
reconfigurable processor. 

2 Related Work 

The literature review on related work is distributed into 2 phases. In the first phase we will 
discuss the approaches taken by other researchers in the design of reconfigurable processors. But 
the design methodology for a reconfigurable processor can be better analyzed and understood by 
breaking it down into smaller sub-problems which are linked together. Therefore we have 
distributed the second phase of the literature survey wherein relevant background work carried 
out by researchers in other fields of engineering (who share similar sub-problems) will be 
periodically provided for the smaller problems individually. 

2. 1 Review of literature on Reconfigurable Computing 

Reconfigurable computing is a paradigm of computing which aims to achieve the performance of 
an ASIC with the flexibility that is associated with a General-purpose processor. ASICs offer the 
best individual hardware implementations, but lack flexibility to execute a variety of these 
algorithms. General purpose computing machines are highly flexible as any algorithm can be 
expressed in terms of the machine's language (instruction set). The concept of reconfigurable 
computing attempts to find a solution that has the advantages of both paradigms and has been 
well defined in [1,8]. A wide range of solutions varying from the Xilinx FPGA to HCS research 
laboratories 9 Dynamically Reconfigurable Network Processor have been proposed so far. Apart 
from the plethora of work being undertaken in the academia, various companies have also come 
up with commercial solutions. . This section presents a survey of both commercially available 
reconfigurable solutions and those proposed by researchers in the academia. 
Reconfigurable devices can be categorized as follows according to their properties. 

1 . Reconfiguration methodology 

> Run-time 

> Static 

> Hybrid 

2. Granularity 

> Look Up Table based Field Programmable Gate Array 

> Configuration Logic Block 

> Functional Unit 

3. Interconnect 

> Mesh 

> Hierarchical 

> Linear 



> Cross-bar 

4. Functionality 

> DSP (Digital Signal Processing) 

> Co-processor 

> Accelerator 

> Stand-alone 

5. Application 

> General-purpose 

> Class of applications 

> Special-purpose 

Criteria for classification of reconfigurable processors 

a. Reconfiguration methodology 

Reconfiguration methodology gives rise to three types of Reconfigurable Processors (RP): 
Statically reconfigurable, dynamically reconfigurable and hybrid. By hybrid, we refer to those 
processors which can be partially reconfigured at run-time. 

b. Granularity of a reconfigurable processor 

Granularity of a RP defines the types of applications that the RP can attempt to solve. This 
property along with the interconnect structure determines the power consumption of the RP to a 
great extent A coarse-granular processor must be supported by an efficient routing architecture 
to keep the power consumption within bounds. Granularity can range from the fine-granular 
FPGAs to the coarse-granular Reconfigurable SOCs. Three levels of granularity can be 
identified: Look-up table based, Configurable Logic Blocks based, Functional unit based. 

c. Interconnect structure of the reconfigurable processor 

Interconnect structure is an important attribute of a reconfigurable processor. It has been shown 
that 70% of the energy consumed by a reconfigurable chip is due to interconnects. Interconnect 
architecture must serve two varying purposes: provide maximum connectivity between 
processing blocks and occupy minimal area of the circuitry. The number of switches along the 
various interconnect paths must be minimized, because this determines the time taken to 
reconfigure a path or circuit. Four standard interconnect architectures have been used widely. 
They are mesh, hierarchical, linear and crossbar. Some of the RP use a combination of two or 
more of these interconnect structures. 

d. Functionality of a reconfigurable processor 

This property determines how the RP is implemented in the overall architecture and defines the 
purpose of the RP. Parameters like the amount of resources available to the RP and the level of 
interactivity between these resources can be determined. The four types of RP based on this 
criteria are DSP, coprocessor, accelerator and stand-alone. 

e. Target application domain of the reconfigurable processor 

There are three types of target applications that existing RP have been designed for. Some 
processors have been designed to support general-purpose computing with reconfiguration being 
used to accelerate the execution process. A second type of processors targets the special-purpose 
domain, wherein they try to gain an enhancement in performance with the help of 
reconfiguration. A third type is targeted towards a class of applications wherein the applications 
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have a large number of kernels in common. Reconfiguration can be used to switch applications 
in this class. 



Table 1: Classification of Reconfigurable Processors (RP) 
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Accelerator 


Class of 
applications 


Low energy 
FPGA 


Static 


FPGA 


Mesh 


Accelerator 


General-purpose 


PADDI 


Static 


FU 


Crossbar 


Stand-alone 


Special-purpose 


Pleiades 
architecture 


Static 


FU 


Hierarchical 


Accelerator 


Class of 
applications 


XiRisc 


Static 


FU 


Linear 


Co-processor 


General-purpose 


Cognigine- 
VISC 


Static 


FU 


Other 


DSP 


General-purpose 



Table 1 shows the classification of various reconfigurable processors based on the 
aforementioned criteria. Following is a brief survey of RP based on the reconfiguration 
methodology used. 

Static and Dynamically reconfigurable processors 
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Piperench [9] is a pipelined reconfiguration fabric. It has multiple pipes and each' pipe consists of 
a linear array of reconfigurable CLBs. The interconnect structure is linear enabling data transfer 
from left to right. The amount of computation that can be done by each CLB is small. Hence 
realization of kernels with large execution time is not possible. Performance is achieved via 
pipelining. Run-time reconfiguration is used to control data flow between CLBs* Though initial 
version of Piperench did not support memory access, this has been corrected in Piperench+. 

Field Programmable System On a Chip (FTPSOC) [10] consists of a microcontroller, a 
programmable digital unit, a configurable analog unit and internal memory. The digital unit is 
known as Digital Macro Cell (DMQ and contains a 2-D array of 4-bit programmable elements. 
The 4-bit programmable elements contain a sequential block, combinational block and internal 
routing resources. These elements are programmed with the help of Look-Up Tables (LUTs). 
The microcontroller has complete access to these LUTs and can write onto them in one clock. 
This facilitates both static as well as dynamic reconfiguration. Routing is done via 16 horizontal 
channels per row and 24 vertical channels per column. Switching matrices are used to 
interconnect horizontal and vertical routes. Special nets are used for clock transfer. Clock 
distribution network is used for power reduction. 

Chameleon Reconfigurable Communications Processor [11] consists of a 32-bit embedded core, 
32-bit reconfigurable fabric and a high-speed system bus. The Reconfigurable fabric consists of 
slices which contains three tiles. Each of these' slices can be reconfigured separately. Each tile is 
made up of 32-bit reconfigurable datapath units, a local storage memory, 16x24 multiplier and a 
Control unit There is complete connectivity between Datapath units. Each DPU is connected to 
its neighbors and other DPUs in the same slice with a delay of one clock and is connected with 
DPUs in other slices with a delay of 2 clocks. The routabilty is dynamic. Since the configuration 
information is built into the instructions, reconfiguration time is minimal. 

Multiple Alu archiTecture with Reconfigurable Interconnect experiment (MATRIX) [12] 
consists of an array of 8-bit Functional Units (FU) interconnected through a configurable 
network. Each FU consists of 256x8 memory, control logic and an 8-bit ALU. A three-level 
hierarchical interconnect is used to connect the Fus. The three levels are Nearest Neighbors, 
length four bypass and global buses. FU port inputs and network lines can be configured either 
statically or dynamically. 

Configurable System on Chip (CSoC) [13] consists of a 4x4 XPP core from PACT, an ARM 
processor, memory elements and interconnects. The Xtreme Processing Platform (XPP) core is a 
hierarchical array of coarse grain Processing Array Elements (PAE). A number of PAEs are 
clustered to form Processing Array Cluster (PAC) each of which is associated with a 
configuration manager. This manager has distributed access over the configuration memory of 
the CAEs. Hence, configuration and processing can be done in parallel. Dataflow is stream- 
based. 

Reconfigurable Pipelined Datapath (RaPiD) [14] is a pipelined reconfigurable unit with multiple 
pipelines. A single pipeline is designed as a linear array of reconfigurable units. Both static and 
dynamic control are available. Static control is used to reconfigure the Functional Units as in any 
typical FPGA. Dynamic control is used to determine dataflow during run-time. Each Functional 
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Unit can perform multiplication and ALU operations on one word of data. External memory 
accesses are not handled by the reconfigurable array. 

Run-time reconfigurable processors 

Garp [IS] architecture consists of a single-issue microprocessor interconnected with a 
reconfigurable array of Configurable Logic Blocks (CLBs). The reconfigurable array is similar to 
a FPGA, the only difference being in the granularity of the processing elements. The processing 
elements are connected using a crossbar interconnect A wide data-path is provided between the 
memory and the array, thereby increasing the bandwidth. Moreover, preloading of configuration 
bitstream into configuration memory of the reconfigurable array is possible. The optimizations 
for a particular application are mainly obtained using a well-defined compiler. Morphosys [16] 
consists of the following parts: a 8x8 Reconfigurable Cell array with context memory, a 
TinyRISC processor used as controller, Frame buffer and a DMA controller. RC array is a 8x8 2- 
D array interconnected using a 2D-mesh architecture along with full row/column 
interconnectivity. Each RC has an ALU, multiplier and a register file. The context bitstreams are 
transferred in two modes - row broadcast and column broadcast. Sea of Processors (SOP) [17] 
consists of a large number of logic elements. Each logic element is connected to 8 neighboring 
elements. Connectivity is achieved using switching blocks. Switching blocks are of two types: 
point-to-point and common bus. Logic element consists of Configurable Arithmetic Blocks 
(CAB) which consist of a full adder and flip flops. Memory elements are designed as functional 
memory. 

Chimaera [18] is a RISC architecture with a Reconfigurable Functional Unit (RFU). The RFU is 
a coarse-grain FPGA. Plado [19] consists of a simple Motorola core processor along with a Xilnx 
FPGA as reconfigurable co-processor. This was designed for proof-of-concept. Emphasis was 
more on compiler techniques than on hardware realization. PRISM [20] is a RAM-based FPGA 
system. It was designed as a proof-of-concept. There is a strong link between the compiler and 
the underlying hardware. Common Minimal Processor Architecture with Reconfigurable 
Extension (COMPARE) [21] is a typical RISC architecture. It has a 3-stage pipeline consisting 
of Instruction Fetch/Decode unit, Reconfigurable Functional Unit and Load/store unit and a 
register file. RPU has the common ALU operations along with a number of Configurable 
Arithmetic Units (CAU). ILP can be exploited as CAU can execute in parallel with ALU. The 
CAU is made up of 44>it and 8-bit LUTs. Concise [22] has a RISC architecture similar to 
CoMPARE. Chess array [23] is a FPGA-type reconfigurable plaform with 4-bit ALUs 
interconnected using 4-bit data buses. A nearest neighbor switching network is used to connect 
the various ALUs. The switching boxes can also be used to store data when not in use. 
Additional connectivity is provided using extra long wires. The local connectivity is made ore 
dense than global connectivity. Spyder [24] is a reconfigurable superscalar co-processor using 
FPGAs. Spyder has a fixed underlying VLIW structure with a single control unit and multiple 
reconfigurable execution units. For REUs, the I/O ports are fixed, but the functionality, can be 
defined by the user. 

Reconfigurable Architecture Workstation (RAW) [25] processor is a chip containing a number of 
identical tile processors. Each tile processor consists of local memory, static switch and a 
dynamic router. The static network is a mesh architecture, while the dynamic routing network is 
a partial crossbar architecture. Static switches are used for simple operations like move, load etc, 
whereas dynamic router is used for data transfer. Each tile consists of 32 KB each of data and 
instruction memory. Data transfer is done via static switches which are pipelined. KressArray 
[26] consists of several reconfigurable Data Path Units (rDPUs). Each of the data paths support 
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multiple data-widths. Communication is done via Nearest Neighbor logic, which is 
programmable at run-time. Kress Array thus resembles a systolic array, the only difference being 
that the interconnects between each processing element is programmable. Colt Configurable 
Computing Machine (CCM) [27] consists of 16 Functional Units interconnected using a 4x4 
mesh architecture. It also contains a 16x16 multipliers and an additional programmable bus to 
connect units that need high bandwidth. This architecture supports both bit-level and word-level 
data transfers which makes it different from conventional FPGAs. Colt CCM is reconfigured 
using a technique known as Worm-hole run-time reconfiguration. Here multiple streams, each 
consisting of the configuration as well as computation-data information, are fed into the pool of 
Functional Units. The header information associated with each stream steers that particular 
stream through a set of FUs that need to be reconfigured by that stream. This architecture is best 
suited for high-performance computations. 

Reconfigurable Multimedia Array Coprocessor (REMARQ [28] consists of 32-bit processors 
known as nano-processor. Each nano-processor consists of a program memory. This contains a 
list of instructions to be executed on that particular processor. But the nano-processor does not 
have control over the same. Every cycle, a PC value is transmitted to each nano-processor which 
determines the index of the instruction to be executed. The processors are connected with the 
help of Horizontal and Vertical bus architectures. Dynamically Reconfigurable hardware 
Architecture for Mobile systems (DReAM) [29] is a datapath oriented architecture and consists 
of numerous Reconfigurable Processing Units each controlled by a RPU controller. The 
interconnect architecture is hierarchical in nature. The control design is also hierarchical. In 
addition to the nearest neighbor connectivity, there are two global buses. RPU consists of two 
Reconfigurable Arithmetic Processing (RPU) units, a speading datapath, a communications 
controller and two dual-port RAMs. Each RAP is built around a 8-bit multiplier and is capable of 
performing ALU and FSM-type operations. This architecture is best suited for wireless 
applications. 

Static reconfigurable processors 

Configurable Algorithm-adaptive Instruction Set Topology (CALISTO) {30] incorporates a 
adaptive instruction-set architecture into a communications processor. Rabaey et al [31] have 
developed LP-PGA, an improvement over existing FPGAs. They observed that 65% of the 
power consumption was due to interconnects. A hybrid interconnect structure consisting of 
Nearest Neighbor, Mesh and Hierarchical interconnects has been proposed. Configurable Logic 
Blocks (CLBs) have been realized using 3-input Look-up Tables (LUTs). A combination of 4 
such LUTs has been found to be optimal. Programmable Arithmetic Devices for high-speed DSP 
(PADDI) [32] is a typical coarse-grain reconfigurable architecture. It contains several 
programmable Execution Units (EXUs) interconnected through a configurable crossbar network. 
Each EXU contains a 16-bit adder and shifter logic along with registers. This architecture is 
meant for high-speed realization of data paths. 

Pleiades [33] is a heterogeneous reconfigurable DSP platform. It consists of a microprocessor 
and a group of reconfigurable modules also called satellites. The satellites are functional units 
which are capable of performing operations like MAC, SAD etc. Nearest Neighbors, Mesh and 
Hierarchical interconnects are all considered. A cost function is defined to determine which 
interconnect structure is best-suited for a given application. The computationally intensive 
kernels are mapped on to the satellites until all the satellites have been assigned. Since the entire 
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application needs to be mapped onto the architecture, it is not always, possible to* meet timing 
constraint In other words, Pleiades does not provide for hardware virtualization. 
Dedicated implementations in the reconfigurable arena for- the graphics class of applications, 
Cohen Sutherland Line Clipping, Mid point Ellipse scan conversion, the bit-Bit (bit block 
transfer) algorithm and Phong Shading have also been carried out( [36], [37] and [38]). 
XiRISC [34] is a load/store VLIW processor with a reconfigurable pipeline. This pipeline is 
realized using Pipelined Configurable Gate Arrays (PiCoGA). This array is tightly coupled with 
the main processor. There is dedicated control logic to direct control flow in a single pipeline. 
Programmable interconnects exist between different pipelines as well. XiRISC is a 32-bit 
architecture. The Cognigine Variable Instruction Set Communication (VISQ [35] architecture is 
a compile-time configuration based processor. It consists of 16 Reconfigurable Communications 
Units (RCU). Each RCU has 4 execution units capable of performing 64-bit computations. 
Amongst these 4, two are meant for ALU operations and two for bit-level manipulations. It has a 
separate instruction cache to hold instructions. Any given application can be compiled and the 
corresponding Variable Instruction Set can be obtained. Hence, the emphasis is.on the Cognigine 
C compiler and not on the underlying architecture. 

There are certain drawbacks to the approaches taken by contemporary efforts: 
The application analyzers are either manual in nature or do not have an efficient automated 
approach to search for patterns between the Control Data Flow Graphs (CDFGs) of algorithms 
that span abstract control data flow structures such as nested loops, hammock structures etc, 
which are intend to be executed on the processor by reconfiguring a previously configured 
algorithm. The fine granularity based approaches taken by contemporary research efforts have 
not considered the problem of designing the hardware based on the amount of reconfigurability 
and scheduling CDFGs such that the amount of on-chip communication for resource and task 
allocation is m inimi z ed. In most cases a fixed hardware is used and reconfigurability is explored 
based on that hardware. In some cases, researchers have used brute force search methods for 
exploring the granularity of the processing modules. When CDFGs involving extensive control 
nodes need to be mapped onto a limited number of resources (processing elements) on the chip, 
configuration aware scheduling becomes critical. 

Current approaches towards building a reconfigurable processor are targeted towards general 
purpose computing or mapping applications onto off the shelf FPGAs or developing specialized 
architectures for a class of applications with an intuitive approach towards the design. The 
increasing demand for power and configuration time aware processing with stringent constraints 
for flexibility necessitates the design and development of a dynamically fast reconfigurable 
processor with awareness towards lowering power consumption. The methodology for the design 
of such a reconfigurable processor is now presented. 

3 The Research Problem 

The research problem of designing a framework/methodology for the development of fast, 

dynamically reconfigurable processors consists of 3 primary tasks: 

1. To partition the CDFGs of applications into 'reconfigurable* and 'non-reconfigurable' 
regions. The reconfigurable regions in the CDFGs must then be executed by customized 
and individually designed monolithic entities (CLUs) on the processor. This partitioning 
is an important step because, larger amounts of commonality in the tasks of the CDFGs 
represent lower- amount of reconfiguration and hence faster task swapping on the same 
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units in silicon. This also indirectly reduces the power consumption by lowering the 
amount of on-chip inter CLU communication. 

2. Determination of the number of 'instantiations' of these entities to extract parallelism 
(instruction, data and group of instruction level) & optimize* resource utilization. 

3. Scheduling the CDFGs onto CLUs, with awareness for reconfiguration time. 
In the following section we provide a conceptual guideline to design this framework. 

3.1 Conceptual guideline 

The guideline consists of choosing an assortment of computationally intensive algorithms, 
followed by discrete and distinct steps that lead to the design of the framework. 

1. Choosing the algorithms that will constitute a benchmark suite. Target algorithms to be 
executed on the reconfigurable processor are from the set of media processing and 
graphics algorithms. These include the algorithms of advanced profiles of MPEG-4 
(main, core etc.), and graphics related algorithms such as line Clipping, Shading, Ellipse 
scan conversion and block transfer algorithms. 

These operations have complex geometric computations and large control flow 
structures. These are typically implemented on media accelerator cards or high end 
computing machines. There is scope for large amounts of parallelism as well as sufficient 
correlation amongst the applications. Literature survey shows that there have been 
implementations of some of these algorithms on FPGAs. The high levels of parallelism 
they offer can be exploited in FPGA forms of programmable architectures. 
These algorithms share the property of having inherent parallelism at both data and task 
levels. Programmable arrays of spatial processing elements have an inherent ability to 
exploit the data and task parallelism in applications. But the complexity of the processing 
elements and their interconnectivity varies from algorithm to algorithm. Each of these 
algorithms has found implementations in FPGAs, specialized ASICs and general purpose 
computing machines. 

We propose a tool for this purpose. Our tool will include portions of CDFGs (Control 
Data Flow Graphs) which will include control / branch instructions (points or nodes). We 
believe that applications are best analyzed in the form of CDFGs. These belong to the 
•graph* class of data structures. Finding isomorphism or near isomorphism for graphs or 
sub-graphs will produce the set of Largest Common Sub-graphs, that can be implemented 
as ASICs in a fully customized fashion. This will remove the need to have LUTs on a 
large scale on the reconfigurable platform. Therefore deviating from the architecture 
model of programmable "Gate Arrays". 

In onier to exploit the concept of reconfigurability + parallel spatial execution, and yet 
maintain reduced amount of reconfigurations, correlation in the nature of the applications 
that are aimed to be ported on the processor needs to be exploited. This methodology 
• helps by eliminating switches (Pass transistors) that are always or mostly closed (on) for 
the same connection on a given route (within the same hierarchy level or across hierarchy 
levels). Therefore we intend to show that by using the proposed methodology of 
determining the correlation in processing and communication needs of target algorithms 
we can optimize the Spatially Programmable Array of Elements (SPAE) class of 
architectures for a range of applications. 

When CDFGs involving extensive control nodes structures need to be mapped onto a 
limited number of reconfigurable resources (processing elements) on the chip, task and 
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resource scheduling with awareness to configuration times and control structure behavior 
becomes critical. To support scheduling that achieves near optimal solutions without the 
use of dedicated on-chip dynamic schedulers, a strategy of static scheduling has been 
chosen. We believe that since a generic set of media oriented algorithms can be ported 
onto the reconfigurable processor, an extensive set of scheduling situations arising from 
CDFGs must be theoretically analyzed and solutions be prepared for them. The set of 
solutions shall be in the form of a collection of algorithms that execute on a Network 
Schedule Manager (NSM) that resides on the chip. These are intended to modify the 
schedule table that is stored on the NSM, at run time as and when the need arises. 

a) Identify common clusters in these using a set of steps that form CDFGs from optimized 
gcc compiled assembly files, extract possible parallelism in regions of the application that 
offer such possibilities and groups portions of the CDFG into reconfigurable regions 
called zones or clusters through a graph matching algorithm through canonical label 
comparisons. 

Note: The CDFGs now consist of 2 types of node groups: 

a) Clusterized (These nodes are hence forth referred to as Coarse Grain Processing 
Elements - CGPE) 

b) Ungrouped or non-clusterized (These nodes are hence forth referred to as Fine 
Grain Processing Elements - FGPE) 

3. All CGPEs will be represented as 'Behavioral VHDL" modules. The FGPEs will be 
represented through LUTs (a group of {2-3} 4-input-LUTs). 

Note: All modules (CGPEs & FGPEs) belonging to an application will be connected through 
a Rent's Hierarchical Network Architecture [1]. LUTs will be modeled using a combination 
of ••Behavioral" & Structural" style VHDL. 

4. If there are multiple occurrences of a Processing Element within an Application (for example 
within application (i) Gauss Jordan Elimination), 

• Allocate appropriate number of •Instantiations** based on a partitioning scheme applied 
on the CDFG. 

Note: If (# of Instantiations < # of Occurrences) 

-> Multiple successive configurations (Reconfigurations) of the network for an 
application is necessary. 

5. Static Schedules for each application is obtained. There are 2 behaviors of the CDFG which 
we will permit to influence the static schedule: 

a. Conditional expression evaluations: We take care of this, by obtaining schedules based 
on Improved PCP and Branch-&-Bound techniques for each conditional expression. The 
schedules are merged towards the worst case. 

b. Iterative or loop behavior, hi the 2 nd pre-processing step we have loop unrolled all 
iterative segments with the dual properties of: Absence of any conditional branch 
instructions inside the loop and Known # of iterative count at compile time. Therefore 
only loops that do not satisfy these 2 properties remain to be considered. For all such 
cases, we take the approach of assuming the iteration count obtained whenever possible 
from trace statistics. In cases where trace statistics cannot be obtained a single iteration is 
considered with the time for that iteration being determined by the critical path in the 
loop. 
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6. The Network architecture consists of switch boxes and interconnection wires. The 
architecture will be based on [1]. This will be modeled as a combination of behavioral" & 
"Structural" style VHDL. Modifications that will be made are: 

a. The Processing Elements derived in step 3 will be used instead of the 4 input LUTs that 
were used in Andre's model. 

b. RAM style address access will be used to select a module or a switch box on the circuit. 

c. Switch connections that are determined to be fixed for an application will be configured 
only once (at the start of that application). 

d. Switch connections that are determined to be fixed for all applications will be shorted and 
the RC model for power consumption for that particular connection will be ignored for 
power consumption calculations. 

e. The # of hierarchy levels will be determined by the application that has the maximum # 
of modules, because there is a fixed number of modules that can be connected 

7. There will be 1 Network Schedule Manager (NSM) modeled in "Behavioral" & "Structural" 
style VHDL will store the static schedule table for the currently running application. The 
NSM collects the evaluated Boolean values of all conditional variables from every module. 

8. For placing modules on the network 2 simple criteria are used. These are based on the 
assumption that the network consists of Groups of 4 Processing Unit Slots (G4PUS) 
connected in a hierarchical manner. 

Note: A loop could include 0 or more number of CGPEs. 

Therefore the following priority will be used for mapping modules onto the G4Pus: 

a. A collection of 1 to 4 modules which are encompassed inside a loop shall be mapped to a 
G4PUS. 

i. If there are more than 4 modules inside a loop, then the next batch of 4 modules are 
mapped to the next (neighboring) G4PUS. 

ii. If # of CGPEs in a loop ^2, then they will have greater priority over any FGPEs in 
that loop for a slot in the G4PUS. 

b. For all other modules: 

i. CGPE Modules with more than 1 Fan-in from other CGPEs will be mapped into a 
G4PUS. 

ii. CGPE Modules with more than 1 Fan-in from other FGPEs will be mapped into a 
G4PUS. 

Note: The priorities are based on the importance for amount of communication 
between modules. Both Fan-ins and Fan-outs can be considered, for simplicity, we 
choose Fan-ins to CGPEs only. 

9. Time estimation. 

Time to execute an application for a given area (area estimate models of XILINX FPGAs and 
[1] architectures can be used for only the routing portion of the circuit) and a given clock 
frequency can be measured in VHDL. 

The Time taken to swap applications (reconfigure the circuit from implementing one 
application to another) is dependent on the similarity between (he successor and predecessor 
circuits. We will measure the time to make a swap, in terms of # of bits required for loading a 
new configuration. Since a RAM style loading of configuration bits will be used, it is proven 
[2] to be faster than serial loading (used in Xilinx FPGAs). We expect speed up over the 
RAM style due to 2 reasons: 

a) The address decoder can only access one switch box at a time. So the greater the 
granularity of the modules, the fewer the number of switches used and hence configured. 
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b) Compared to peer architectures which have only LUTs or a mixture of LUTs and CPGEs 
with low granularity (MAC units), we expect to have CGPEs of moderate granularity for 
abstract control-data flow structures in addition to FGPEs. Since these CPGEs are 
derived from the target applications, we expect their granularity to be the best possible 
choice for a reconfigurable purpose. They are modeled in "Behavioral" VHDL and are 
targeted to be implemented as ASICs. This inherently would lead to a reduced amount of 
configurations. 

The Time taken to execute each application individually will be compared to available 
estimates obtained for matching area and clock specifications from work carried out by other 
researchers. 

4 The Methodology 

This section deals with the detailed description of the 3 primary research problems, the 
associated literature survey and proposed solutions. It starts with a description of the process of 
identifying reconfigurable regions in the applications, followed by the technique to determine the 
number of processing units for high resource utilization and lastly discusses the aspects of 
scheduling die tasks onto the processing platform.. 

4.1 Identification of Reconfigurable Clusters 

A Control Data Flow Graph consists of both data flow and control flow portions. In compiler 
terminology, all regions in a code that lie in between branch points are referred to as Basic 
Blocks. In order to identify regions within a CDFG and those between CDFGs, that have nearly 
identical control with embedded data flow structures, it is necessary to examine the possibilities 
of graph homomorphism beyond a basic block. But since, CDFGs are CFGs with embedded 
DFGs, where a basic block can be interpreted as a DFG, it is first useful to perform a mach 
between the DFGs constituting a CFG. Code movement can also be performed to result in 
potential speed ups and identification of larger reconfigurable regions (modified DFGs). But this 
process can be done in a second pass if there is potential for such speedups and reconfigurations 
in the applications being considered. The DFG matches are found using the graph matching 
algorithm described in the following section. Once a potential match is found at the level of a 
DFG or a modified DFG, groups of such DFGs constituting abstract structures such as a 
hammock structure, nested loops etc. are searched for matches. A successful match at this level 
would indicate homomorphism at both the data and control flow levels of the graph. Since 
matches are investigated at levels of basic blocks, potentially modified basic blocks, and abstract 
control data flow structures, regions which are not simple basic blocks, for a generic purpose 
shall be denoted uniformly as zones. Details on the algorithms used to extract zones and identify 
control structures to form CFGs from an assembly file are given in appendix A. We have used 
the industry standard gcc compiler with the following optimizations set (redundant expression 
elimination, constant propagation, copy propagation, constant folding, dead code elimination, 
scalarization, local register allocation, global register allocation, register targeting, call in-lining, 
code hoisting and sinking). 
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4.1.1 Graph Matching 

We will fist provide a brief introduction to the area of graph matching and then provide a 
literature survey of existing techniques. Then the proposed techniques will be discussed with 
the merits and demerits. 
4.1.1.1 Basic Definitions and Terminology 

A graph G = (V,E) in its basic form is composed of vertices and edges. V is the set of 
vertices (also called nodes or points) and EczVxV is the set of edges (also known as arcs 
or lines) of graph G. The order (or size) of a graph G is defined as the number of vertices of 
G and it is represented as |V | and the number of edges as jE|. 
If two vertices in G, say u 9 v e V , are connected by an edgee e E , this is denoted by 
e = (u^ v ) and the two vertices are said to be adjacent or neighbors. Edges are said to be 
undirected when they have no direction, and a graph G containing only such types of edges 
is called undirected. When all edges have directions and therefore (u, v) and (v, u) can be 
distinguished, the graph is said to be directed. 

In addition, a directed graph G = (V,E) is called complete when there is always an edge 

(u,t/) eE=V*V between any two vertices (u, «*) in the graph. 

Exact and inexact graph matching 

The graph matching problem can be stated as follows: 

Given two graphs G, = (F|,27,) and G 2 = (V 2 ,E 2 ) , with \V t \ = \V 2 \ , the problem is to find a 
one-to-one mapping/ : V x -> V 2 such that («, v) e E x iff(f{u\f(y) e E 2 . When such a 
mapping f exists, this is called an isomorphism, and Gi is said to be isomorphic to G2. This 
type of problem is considered to be exact graph matching. 

When an exact match cannot be found between two graphs, (for instance if the number of 
vertices are different), then finding the best matching between than is called 
homomorphism. In this case, the matching aims at finding a non-bijective correspondence 
between a data graph and a model graph. In a homomorphic graph matching problem, if we 
assume |Vi| < |V 2 |, the goal is to find a mapping f':V 2 ->V t such that 
(«,v) e E 2 iff(f(u) 9 f(v) e E x . This corresponds to the search for a small graph within a big 
one. An important sub-type of these problems are sub-graph matching problems, in which 
we have two graphs G = (V,E) and G* = (V\E*), where V e V and E' g E 9 and in this 
case the aim if to find a mapping f : V -> V such that (w, v) e E'iff{f(u\f(y) e E . When 
such a mapping exists, this is called a subgraph matching or subgraph isomorphism. 

4.1.1.2 Literature Survey 

Exact graph matching: graph isomorphism 

This category of graph matching problems has not yet been classified within a particular 
type of complexity such as P or NP-complete. Some papers in the literature have tried to 
prove its NP-completeness when the two graphs to be matched are of particular types or 
satisfy some particular constraints [39, 40], but it still remains to be proved that the 
complexity of the whole type remains within NP-completeness. On the other hand, for some 
types of graphs the complexity of the graph isomorphism problem has been proved to be of 
polynomial type. An example is the graph isomorphism of planar graphs, which has been 
proven [41] to be of polynomial complexity. 
Exact sub-graph matching or sob-graph isomorphism: 
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This particular type of graph matching problems has been proven to be NP-complete [40]. 
However, some specific types of graphs can also have a lower complexity. For instance, the 
particular, case in which the big graph is a forest and the small one to be matched is a tree 
has been shown to be of a polynomial complexity [40, 42]. 
Inexact graph matching: graph and sab-graph homomorphisms: 
In inexact graph matching, where we have (V|| < |V 2 |, the complexity is proved in [43] to be 
NP-complete. Similarly, the complexity of the inexact sub-graph problem is equivalent in 
complexity to the largest common sub-graph problem, which is known to be also NP- 
complete. 

Note: The problem of trying to find clusters in CDFGs lies in this category. 
Some graph matching problems are based on the idea of having more than one model, and 
on performing graph matching to a database of models so that the model that best 
approaches the characteristics of the data graph is selected. Therefore, the aim here is to 
recognize a model rather than going deeply to recognize each of the segments of the data 
image. An illustrative example is found in [44] in which decision trees are used for graph 
and subgraph isomorphism detection in order to match a graph to the best of a dictionary of 
graphs. [45] proposed the method of node growing for graph searches in databases. This 
involves the formation of a pool of 2 node graphs. The candidates for the nodes in this pool 
are all possible nodes from the database of graphs to be searched. Each of these small 
graphs are compared with' every candidate graph in the database. The bottom x % of the 
matches are then pruned. The remaining 2 node graphs are now gown into 3 node graphs. 
Every possibility is grown. Then the process of comparison is repeated and pruning carried 
out The match between 2 graphs is done through comparison of the canonical labels of 
their adjacency matrices. Various canonification functions can be used to derive the labels. 
This method uses the longest possible label. But the drawback of this approach is the lack 
of weights to the edges. Even without weights complex partitioning schemes are applied to 
the adjacency matrices to obtain labels for comparison. Others who have worked on 
template matching based clustering include [46, 47, 48, 49]. 

Some others have reduced the problem of the largest common subgraph problem into 
combinatorial optimization problem. For example in the Graduated Assignment method 
[50] a match between 2 graphs is obtained by formulating the differences between the 
weighted graphs as an objective function. The authors then try to minimize this objective 
function. 

2MM) = 4 2 Z S Z M ai M bJ C b 

subject to Vfl 2M flf ^l, V/ ZM flI ^l, VaiA/ fl , e {0,1} 
i=l a=l 

Here the Mai and Mbj are the same matrix (called matching matrix). The C term is the 
difference in the weights of 2 edges being compared. The summation basically is a 
combination of every possible edge comparison between the 2 graphs. So, if there is an 
exact match between 2 edges then the product of the Ms and C will be 1, else it will be 0. 
Hence the minimum value of the summation represents the maximum number of edge 
matches between the 2 graphs. Since a node in a graph can only match up with one node in 
the other graph, the match matrix should be a permutation matrix. In classical combinatorial 
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problems, assignment problem corresponds to finding a permutation matrix for a given 
sample matrix such that the summation of the chosen elements (an element in the sample 
matrix is chosen if its corresponding entry in the permutation matrix is 1) is maximum- The 
authors try to convert the given minimization problem into a maximization problem, by 
expanding the objective function using taylor's series. They then convert the the discrete 
problem into a. continuous version by using a control parameter. This is done by producing 
an initial match matrix, obtained by exponentiating the error function. This is then 
subjected to iterative row and column normalization which should result in a doubly 
stochastic matrix (Sinkhorn's rule). Sinkhorn's rule states that, any positive matrix when 
iteratively normalized along rows and then along columns will converge into a doubly 
stochastic matrix. A doubly stochastic matrix is one whose summation along any row or 
column is positive and less than or equal to one. The newly obtained matrix is again reused 
in the objective function with a newly increased parameter value. The complexity of this 
approach is O(lm) where 1 and m are the number of edges in the 2 candidate graphs. 
A review on general purpose probabilistic graph matching can be found in [51], where 
different types of probabilistic graphs, different techniques for their manipulation, and 
fitness functions appropriated to use for these problems are presented. Some have also used 
Fuzzy set theory as a means to create vertex and edge attributes to be applied to graph 
matching. References in the literature using this type of attributes for inexact graph 
matching include [52, 53]. [54] transforms the graph matching problem into the maximum 
clique problem and proposes a generic solution free from application domains. [55] 
proposes the use of a Lagrangian relaxation network to match graphs. 
Most of these approaches suffer from some disadvantages. Methods such as the graduated 
assignment method being iterative, gives no indication as to how fast the objective function 
converges. It is also an approximation approach since it involves maximization of an error 
term. The growing nodes method is very slow and cumbersome. Therefore the following 
method is proposed, which has low complexity and designed specifically for matching 
conditional data flow graphs. 

In the following approaches, we start with CDFGs representing the entire application and 
which have been subjected to zone identification, parallelization and loop unrolling. The 
zones / Control Points Embedded Zones (CPEZ) that can be suitable candidates for 
reconfiguration will be tested for configurable components through the following 
approaches. Note: Each Zone / CPEZ will be represented as a graph. 

4.1.1.3 Proposed Approach m . 

All computations are best analyzed for hardware synthesis by representing the applications 
or algorithms to be implemented, in terms of Control Data Flow Graphs. In the context of 
CDFGs, there are two primary restrictions on the graphs. The first is that, apart from loops, 
no other edge can traverse the graph in a direction against the cycle increment. For example, 
in the graph shown in Figure 10, all edges are directed and flow is from left to right. The 
other restriction on the graphs is that all CDFGs must be developed based on 2 axes: The x- 
axis, representing the cycle info, and the y-axis representing the operation or task being 
performed. By taking advantage of these restrictions, if we can represent graphs in terms of 
strings, then low complexity algorithms for the purpose of string matching can be 
effectively used. It is well known that graphs can be represented through adjacency 
matrices, whose canonical labels can be effectively used as strings. We now have to 
develop strings from these graphs, the CDFGs. We do that by populating adjacency 
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matrices such as the one shown in the table of Figure 1. This has an array of indices 
associated with row indexes, whose rows have one or more entries. An entry (or count in 
the table) indicates that an edge exists between the source vertex (indicated by a row index) 
and the destination vertex (indicated by the destination vertex). The array associated with a 
row index, are the columns in that row where a populated count exists. 
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Figure 1 : A Single Graph with the Adjacency Matrix 

Since all zones in the CDFGs need to be represented in this manner, preparing the adjacency 
matrices cannot be avoided. But while reading out the matrices into canonical labels, all 
elements in the tables need not be checked for Null value. Therefore a string of tasks, 
associated with cycle information is obtained as shown below: 
al-a2, al-a2, bl-a2, bl-a2, bl-c2, a24>4, a2-c3, c2-c3, c3-b4, b4-b5. 

The string of edges (elements) obtained this way will now be sorted using an efficient 
algorithm such as Merge Sort whose complexity is 0(nlogn). Prior to sorting, all cycle info 
is hidden. It must be noted that since multiple nodes can exist at a particular cycle, a 
mechanism is required to distinguish between cases of multiple fan-ins or fan-outs to a node 
and multiple nodes of identical type with lower number of fan-ins or fan-outs. For example, 
in the figure shown below (Figure 2), the two cases should not be encoded identically. 
Therefore to distinguish between the two cases, a cycle is split into as many number of 
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stages as dictated by the maximum node copies of a particular type. In the example shown 
on the right, node b (op2) in cycle 4 has 3 copies, so cycle 4 is split into 3 stages. The nodes 
are then spread out among these cycles. The cycles are then renumbered increasingly. 



op2(b) 



opl (a) 




op2(b) 



opl (a) 



1 2 3 




12 3 4 



op2 (b) 
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Figure 2: Cycle Splitting and renumbering 

This helps in uniquely identifying the labels for the edges and nodes. The cycle information 
of the nodes is used while identifying a match between edges in similar bin pairs as 
explained below. 

After hiding the cycle information, this process will result in the following string sequence 
being generated: 

aa, aa, ac, ba, ba, bb, be, cb, cc 

The criteria for sorting is based on the fact that an edge consists of 4 basic elements (Source 
Clock, Source Operation, Destination Clock, Destination Operation). If the Clock 
information is now hidden, then the SO of two edges are compared and the one with the 
lower rank is placed to the left. In the example shown, source operation V has a lower rank 
than 4 b* and V. If the SO of the edges are the same, then their DO are compared. The same 
rule applies; the DO with the lower rank, is placed to the left. In this manner, the string is 
sorted. Now these pairs of alphabets will be placed into bins. In order to place diem the first 
or the left most pair (aa in our example) is assumed to be the head of the queue. It is placed 
in the first bin. Then all the following elements in the queue are compared with the head, till 
a mismatch is obtained. If a match occurs then, that pair is placed in the same bin as the 
head. Now the first mismatched pair is designated as new head of the queue. This is now 
placed in a new bin and the process is followed till all elements are in a set of bins as shown 
in the following Figure (3). 
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Figure 3: First Graph's edges arranged into a Bin Sequence 

The next step is to perform a similar but not exactly the same process for the graph that needs 
to be compared with the candidate graph, graph #1. 
Consider a second graph, graph #2 as shown in Figure 4. 

operation 




r i 2 3 4 s c y° le 

Figure 4: The Second Graph 

This graph is converted to a string format in the same manner as graph #1 and this string, as 
shown below needs to be placed into a new set of bins, 
aa, ab, ab, ba, ba, bb, bb, be, cb, cc 

This is done by assigning the leftmost element in the queue to be the head. It is first 
compared to the element type in the first bin of the old set (aa) [This is termed as the 
reference bin]. If it checks to be the same, then the first bin of the new set is created and all 
elements upto the first mismatch are placed in this bin. Then the reference bin is termed as 
checked. Now the new head type is compared to the first unchecked bin of the reference set. 
If there is a mismatch, then the comparison is done with the next unchecked! bin and so on, 
until the SO of the element type is different from the SO of the element type in the reference 
bin. At this point, a comparison of all successive element pairs in the current queue are 
compared with the head, till a mismatch is met. Then the matched elements are eliminated. 
But incase, a match is found between the head of queue and a reference bin; then a new bin 
in the current set is created and suitably populated. The corresponding reference bin is 
checked and all previously / predecessor unchecked reference set bins are eliminated. 
By this approach, we are eliminating comparison between unnecessary edges in the graphs. 
Now a new set of bins for graph 2 is obtained as shown .(Figure 5): 
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Figure 5: The Second Graph arranged in a Modified Bin Sequence 
In the first bin set, the one containing 'ac' is eliminated. 

To find the largest Common Sub-graph, we need to operate on corresponding pairs of bins 
(bins with same SODO) from both queues, where both bins have elements in them. While 
comparing edges in a bin pair, the cycle information is now used. To establish a match 
between source(s) to destination node structures, we consider all fan-ins to a destination 
node. That is, if there exists a match between an edge in one bin and an edge in the 
corresponding bin pair, then all other fan-ins to the first edge's destination node are 
compared to all fan-ins to the second edge's destination node. Only if a match is found in all 
the fen-ins, then the two structures are classified as a match. Then these edges are cancelled 
out of the bin sequence. Thus all edges are exhausted form the sequence. The set of matched 
edges now constitute the largest common sub-graph (ksg)- While performing a match all 
properties of edges & nodes have to be considered: cycle info, type of data transaction 
(fixed, floating point), bit precision / data width (1,2..4, 8 bits etc..). Once a lcsg is extracted 
from a pair of basic blocks the lcsg is identified henceforth as a node with a unique label, 
then matches are searched for in the governing control structure of these basic blocks. This 
leads to the search for abstract data flow embedded control structures. To perform the match 
among a group of basic blocks interconnected through control flow edges, appropriate labels 
are given to lcsgs obtained from basic blocks, modified basic blocks and zones. 
The advantage of this approach is that a fan-in structure (multiple source nodes feeding a 
destination node) is subject to a match from a potential edge match from a bin pair. At the 
first mismatch in a primary fain-in structure's edge, any further search for matching edges in 
the secondary structure is abandoned. The search resumes with next edge in the secondary 
bin pair. If no matching structure is found, then all edges corresponding to the first structure 
are eliminated from the primary bin sequence. This approach takes advantages of the nature 
of graphs that represent instruction sequences for computational problems. Analyzing 
instruction sequences with simpler data structures such as doubly linked lists makes deriving 
common and configurable clusters very difficult It is also not possible to extract such 
common sub-graphs from template matching methods because, that rely heavily on the 
initial templates and growing of templates beyond 8-10 vertices makes the problem quite 
complex. 

Therefore after subjecting the zones / CPEZs to the cluster (Largest Common Sub-graph) 
extraction process, the CDFG representing the application as a whole will consist of 2 types 
of entities: Clusters & non-clusterized zones (or CPEZs). For the purpose of scheduling, 
they will be termed as 'Processes'. For the purpose of implementation in VHDL, we will 
model the clusterized processes in 'Behavioral* form. The non-clusterized zones will be 
implemented on generic LUTs, modeled in 'Behavioral & Structural' style VHDL. An 
example of such a CDFG is shown in Figure 6, where processes belonging to cluster types 1 
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& 2 will be mapped onto processing units specialized to execute them, where as non 
clusterized processes indicated in blue, will be mapped onto generic LUT structures. 




Figure 6: An Example CDFG 
4.2 Number of Processing units for a Cluster Type 

Once the clusters have been identified, it is necessary to choose judiciously the number of 
processing units necessary to execute jobs (clusters) belonging to the same type. This process 
determines the amount of spatial parallelism that the reconfigurable processor can offer. To 
maximize the utilization of each processing unit, the following divide and conquer approach is 
followed. 

4.2.1 Partitioning by Divide & Conquer Approach 

Let the number of resources allocated for Process of type PEi be Ni. In Figure 6, the 
following configuration has been assumed. There are three processors PEI, PE2 and PE3. Nl 
= N2 = N3 = 1 , one for each type of process. 
Determination of number of resources for each type of process: 

Obtain the sub-graphs for every possible path. For example, the graph in Figure 6 is used to 
obtain the patl} for DCK (sub-graph shown in Figure 7). 
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Source and Sink nodes 
Process type 1 
Process type 2 



Process type 3 

6 



Figure 7: Sub-graph for condition DCK 



For this particular example graph, 6 such sub-graphs are obtained, one for each of the 
conditions: D CK , D CK , DCK , DCK , DC , ~DC . 

For any sub-graph, we can determine the number of units of each type of processor. This is 
done by isolating the nodes corresponding to a processor type. For example, in the DCK 
example taken above, 3 more graphs can be obtained as shown in Figure 8 and 9. It might not 
make sense to apply this policy to all the 6 sub-graphs. Therefore only those deemed as most 
likely to be taken should be considered* la case an unlikely path does end up being taken, the 
clock speed for the general purpose computing resources (the programmable LUTs) must be 
designed suitably if real-time requirements exist. 
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Figure 9: Graphs obtained for Individual Process type 3 

For a graph with only a specific process type highlighted, we obtain the number of processor 
units, by identifying the critical path and separating the graph into processes in the critical 
path and those outside it For example if we had a graph as shown in Figure 10, with the 
critical path marked in dotted line arrows, we would group processes PI, P5, P8, P9, P10, 
P12 and P13 into the primary group. And we would place all the other processes into another 
group called the secondary group. If the combined execution time in the primary group is say 
Tp and the combined execution time in the secondary group is Ts, then we check for the ratio 
of Tp : Ts. If the ratios are close to 1:1, then it means that most likely, maximum benefit can 
be obtained by scheduling each of the groups onto 2 parallel processors. If the ratio is l:x 
where x^, then in the secondary group, a critical path is identified. Thus the secondary 
group is similarly split into 2 groups. We proceed in this divide and conquer method till a 
1:1:1... or a close ratio is obtained. But if Tp : Ts is x : 1, then there would be an 
underutilization of resources if additional processing units are allocated. In this case we 
might be better off using a single resource allocation. 
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Figure 10: Divide and Conquer Method of Determining Number of Parallel Units 
43 Scheduling 

Once the number of processing units has been chosen, the CDFGs have to be mapped onto these 
units. This involves scheduling, i.e allocating of tasks to the processing units in order to 
complete execution of all possible paths in the graphs with the least wastage of resources but 
avoid conflicts due to data and resource dependencies. In this section we first present a literature 
survey followed by the proposed scheduling strategy. 

4.3.1 Literature survey 

We now address the issue of task scheduling. In the graph matching problem, we can include 
branch operations to reduce the number of graphs. This can be done, if one of the paths of a 
branch operation leads to a very large graph compared to the other path, or is a subset of the 
. other path. This still leaves us with the problem of conditional task scheduling with loops 
involved. Since scheduling is applicable to many diverse areas of research, in this section we 
will not discuss all the work done in scheduling. Instead we focus on those that are relevant 
to mapping data flow graphs on processors and propose a method most suitable for the 
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purpose of reconfiguration and compare it with the contemporary methods. Several 
researchers have addressed task scheduling and one group has also addressed loop scheduling 
with conditional tasks [57]. A detailed survey of data and control dominated scheduling 
approaches can he found in [58], [59] and [60]. Chekuri's [61] paper discusses profile driven 
scheduling based on the earliest branch node retirement scheme. This is applicable for trees 
and s-graphs. An s-graph is a graph where only one path has weighted nodes. In this case, it 
is a collection of DAGs representing basic blocks which all end in branch nodes, and the 
options at the branch nodes are: exit form the whole graph or exit to another branch node as 
shown in Figure 11. The two DAGs are in blue and violet and the branch nodes are in red 
with probabilities of exiting the system being p and q. Such a graph is scheduled on a generic 
set of processing elements such that the branch node with the least schedule length to sum of 
exit probabilities fiom the system (upto that branch node), is scheduled as early as possible. 
The denominator implies how many sub-graphs 'can be completed'. The numerator implies 
'how fast this can be done' or 'what is the least amount of time it takes to do it*. Therefore, a 
low rank indicates that a large number of sub-graphs can be completed in a small amount of 
time. This is a form of list based scheduling where they try to minimize the expected time of 
completion. 



The problem with this approach is that it is applicable only to 'small, as defined by the 
authors* graphs and also restricted to S-graphs and trees. It also does not consider nodes 
mapped to specific processing elements. Others such as [62] and [63] also consider trace 
information and string together many basic blocks. Approaches taken by [64] provide 
hardware support for branch predictions. Among those who have worked on scheduling for 
conditional flow graphs include [65] and [66,67]. Jha's paper [57] addresses scheduling of 
loops with conditional paths inside them. This is a good approach as it exploits parallelism to 
a large extent and uses loop unrolling. But the drawback is that the control mechanism for 
having knowledge of 'which iteration's data is being processed by which resource' is very 
complicated. This is useful for one or two levels of loop unrolling. It is quite useful where the 
processing units can afford to communicate quite often with each other and the Scheduler. 
But in our case, the network occupies about 70% of the chip area [1] and hence cannot afford 
to communicate with each other too often. Moreover the granularity level of operation 
between processing elements is beyond a basic block level and hence this method is not 
practical. And within a processing element, since the reconfiguration distance (edit distance) 
is more important, fine scale scheduling is compromised because the benefits with the use of 




Figure 1 1 : Early Retirement Schedule 



29 



• 



very fine grain processing units is lost due to high configuration load time. [68] paper 
discusses a 'path based edge activation' scheme. This basically means, if for a group of nodes 
(which must be scheduled onto the same processing unit and whose schedules are affected by 
branch paths occurring at a later stage) we know ahead of time the branch controlling values, 
then we can at run time prepare all possible optimized list schedules for every possible set of 
branch controller values. In the following simple example shown in Figure 12, the nodes in 
gray need to be scheduled on the same processing unit The branch controlling variable is b 
which can take values of 0 or 1. In case it takes a 0, the branch path in red is taken, else the 
path in green is taken. In the case where we can know at run time, yet ahead of time of 
occurrence of the branch paths, the value of *b\ we can prepare schedules for the 3 grey 
nodes and launch either one, the moment b's value is known. 




Figure 12: Path based edge activation 

This method is very similar to the partial critical path based method proposed by [69]. It 
involves the use of a hardware scheduler and is quite well suited for our application. But we 
need to add another constraint to the scheduling: the amount of reconfiguration or the edit 
distance. In [69] the authors tackles control task scheduling in 2 ways. The first is partial 
critical path based scheduling, which is discussed above. Although they do not assume that 
the value of the conditional controller is known prior to the evaluation of the branch 
operation. They also propose the use of a branch and bound technique for finding a schedule 
for every possible branch outcome. This is quite exhaustive, but it provides an optimal 
schedule. Once all possible schedules have been obtained, the schedules are merged. The 
advantages are that it is optimal, but its has the drawback of being quite complex. It also does 
not consider loop structures. Other papers that discuss scheduling onto multiprocessor 
systems include [70], [71] and [72]. Among other works carried out on static scheduling by 
([73] and [74]) involve linearization of the data flow graphs. Some others have also taken 
fuzzy approaches [75] and [76]. 
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43.2 Proposed approach 

Given a control-data flow graph, we need to arrive at an optimal schedule. Section 4.3.2.1 
explains the CDFG. This is followed by the methodology to obtain near optimal schedules. 
This involves a brief discussion on the PCP scheduling strategy followed by an enhancement 
to this approach to arrive at a more optimal schedule. Section 4.3.2.3 explains the need to 
involve reconfiguration time as additional edges in the CDFG. Section 4.3.2.4 talks about 
ways to handle loops embedded with mutually exclusive paths and loops with unknown 
execution cycles. 

43.2.1 Control-Data Flow Graph 

A directed cyclic graph has been used to model the entire application. It is a polar graph 
with both source and sink nodes. The graph can be denoted by G (V, E). V is the list of all 
processes that need to be scheduled. E is the list of all possible interactions between the 
processes. The processes can be of three types: Data, communication and reconfiguration. 
The edges can be of three types: unconditional, conditional. Here a simple example with no 
loops has been shown in Figure 13. 




Figure 13: An Example of a Control Data Flow Graph 
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In the above graph, each of the circles represents a process. Sufficient resources are 
assumed for communication purposes. All the processes have an execution time associated 
with them, which has been shown alongside each circle. If any process is a control-based 
process, then the various values to which the condition evaluates to are shown on the edges 
emanating from that process circle* 

4.3.2.2 Methodology 

L Use PCP scheduling to determine the delays for each possible path of the CDFG and 
arrange the list of paths in descending order of the delays. 

ii. Perform branch and bound based scheduling (which need not be done for every path to 
reduce the complexity). 

iii. Once the final list of all schedules is ready, merge all the schedules by respecting data 
and resource dependencies. 

PCP scheduling : 

PCP is a modified list-based scheduling algorithm. The basic concept in a Partial critical 
path based scheduling algorithm is that if we have a situation as shown in Figure 14 below, 
where Processes Pa, Pb» Px> Py are all to be mapped onto the same resource say Processor 
Type 1. P A and Pb are in the ready list and a decision needs to be taken as to which will be 
scheduled first. )*a and >b are times of execution for processes in the paths of Pa and P B 
respectively, but which are not allocated on the Processors of type 1 and also do not share 
the same type of resource. 




Figure 14: PCP based Scheduling 
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If P A is assigned first, then the longest time of execution is decided by the 
Max (T A + Xa , T A + T B + Xb) 

If Pb is assigned first, then the longest time of execution is decided by the 
Max (T B + Xb , T B + T A + Xa) 

The best schedule is the mimimim of the two quantities. This is called the partial critical 
path method because it focuses on the path time of the processes beyond those in the ready 
list Therefore if Xa is larger than Xb, a better schedule is obtained if Process A is scheduled 
first But this does not consider the resource sharing possibility between the processes in 
the path beyond those in the ready list. A simple example (Figure 15) shows that if T A = 3, 
T B = 2, Xa = 7, Xb = 5, where in processes in the Xa and Xb sections share the same resource, 
say Processor type 2, then scheduling Process A first gives a time of 15 and scheduling B 
first gives a time of 14. But both the critical path and PCP as proposed by Pop suggest 
scheduling A first. 



Figure 15: PCP Scheduling with Resource Dependencies in the Partial Path Region 

The difference is because, if the resource constraint of the post ready list processes is 
considered, the best schedule is a min of 2 max quantities: 
Max (T B , \0 &Max (T A> Xb). 

Pop [69] uses the heuristic obtained from PCP scheduling to bound the schedules in a 
typical branch and bound algorithm to get to the optimal schedule. But branch and bound 
algorithm is an exponentially complex algorithm in the worst-case. So there is a need for a 




T B = 2 



Xb = 5 
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lesser complex algorithm that can produce near-optimal schedules. From a higher view 
point of scheduling we need to limit the need for BB scheduling as much as possible. 

Initially, the control variables in the CDFG are extracted. Let cl, c2, ,cn be the control 

variables. Then there will be at most 2 n possible data-flow paths of execution for each 
combination of these control variables* from the given CDFG. An ideal aim is to get the 
optimal schedule at compile time for each of these paths. Since the control information is 
not available at compile time, we need to arrive at an optimal solution for each path with 
every other path in mind. This optimal schedule is arrived at in two stages. First the optimal 
individual schedule for each path is determined. Then each of these optimal schedules is 
modified with the help of other schedules. 

Stage 1: There are m=2 n possible Data Flow Graphs (DFG's). For each DFG, the PCP 
scheduling is done. Then, the DFG's are ordered in the decreasing order of their total 
delays. An optimal solution can be obtained by doing branch and bound scheduling for 
each of these PCP scheduled DFG's. But branch and bound is a highly complex algorithm 
with exponential complexity. In this case, this complex operation needs to be done 2 A n 
times, where n is the number of control variables, which increases the complexity way 
beyond control. Hence Branch and bound is done only when it is essential to do so. Then 
BB scheduling is done for DFG1, which has the largest delay. For DFG2, the PCP delay is 
compared with the BB delay of DFG1. If the PCP delay is smaller, then the PCP scheduling 
is taken as the optimal schedule for that path. If not, then the BB scheduling is done to get 
the optimal schedule. It makes sense to do this, as the final delay of each DFG after 
modification is going to be close to the delay of the worst delay path. In the same way, the 
optimal schedule is arrived at for each of the DFG. 

Stage 2: Once the optimal schedule is arrived at, a schedule table is initialized with the 
processes on the rows and the various combinations of control variables on the column. A 
branching tree is also generated, which shows the various control paths. This contains only 
the control information of the CDFG. There exists a column in the schedule table 
corresponding to each path in this branching tree. The branching tree is shown in Figure 16. 
The path corresponding to the maximum delay is taken and the schedule for that 
corresponding path is taken as the template (DCK*). Now the DCK path is taken and the 
schedule is modified according to that of DCK\ This is done for all the paths. The final 
schedule table obtained will be the table that resides on the processor. 




Figure 16: Branching Tree 



The pseudo code of this process is summarized in Appendix G. 
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We also observe that processes with large execution times have a greater impact on. the 
schedule than the shorter processes. Hence, we decided to schedule large processes in a 
special way. The shorter processes can be scheduled using the PCP scheduling algorithm. 
Since PCP scheduling is done for most of the processes, the complexity stays closer to 
0(N), where N is the number of processes to be scheduled. 

a) Identify the first set of processes that need to be scheduled onto the same processor 
which are computationally complex. Let's call them MP1, MP2.... (Macro process 1 
etc.) 

b) Schedule all the processes till these macro processes in the data flow graph using PCP 
scheduling. 

c) Calculate the estimated execution time of the smaller processes to find the start time of 
each of the macro process. 

d) Determine the next set of such macro processes in die DFG. Let's call them MP_subl , 
MP_sub2... 

e) For processes amidst these two sets of macro processes, PCP scheduling is used. 

f) For processes occurring after the second set of macro processes, the execution times are 
added up to get the total execution time. 

g) Now, determine the order of execution of these processes by estimating the worst-case 
execution time in each case and selecting the best amongst them. 

h) After this scheduling, the block after the second set of macro processes is taken as the 
current DFG and steps a-g re implemented. 

i) Step h is repeated till the end of DFG is reached. 

Schedule merging : 

In the schedule table there are some columns representing paths that are complete and some 
that are not The incomplete paths can be now referred to as parent paths of possible 
complete paths. 

In the example shown in Figure 13, we see that for earliest evaluation of all conditional 
variables (viz. D, C, K) it is necessary to evaluate D first, then C and then K. Therefore the 
tree of possible paths is as shown in Figure 16. Now, while creating the schedule table, 
initially only consider the full possible paths i.e. , the 6 paths listed in Figure 16. Perform 
scheduling by the suggested algorithm. This will fill these columns. Then create the 
remaining column of partial paths (i.e., D, D^, . . .etc). These are now just empty columns. 
Now if a process has the same start times in multiple columns, then push it into the parent 
empty column. This approach tries to obtain the worst case delay and merge all paths to 
that timeline. Since the D^KO^ar) path had the worst case optimal delay, all other full 
paths ware adjusted to match this path. But it is also necessary to consider the probability 
of the occurrence of all the full paths (6 of them). Then prune out the bottom 10% of the 
paths, that is, disregard those full paths whose probability of occurrence is less than a 
threshold value when compared to the path with most probable occurrence. 
Then a path is selected firom the remaining ones, whose probability of occurrence is the 
highest This will be the new reference to which all the remaining paths will adjust to. Now 
it is likely that these chosen full paths and the disregarded full paths, share certain partial 
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paths (parent paths). Therefore, while allocating the start times for the processes that fall 
under these shared partial paths, we must allocate them based on the worst (most , delay 
consuming) disregarded path which needs (shares) these processes. While performing 
schedule merging, all data dependencies must be respected. 

43.23 Reconfiguration 

In the discussion so far, we had not emphasized the need to involve reconfiguration times 
in the CDFGs. With an example we will show how this time can influence the tightness of 
a schedule. Consider the following task graph (Figure 17). 




Figure 17: Influence of Reconfiguration time on Scheduling 
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In the task graph, say V is a variable that influences the decision on which of the two 
mutually exclusive paths (red or green) will be taken, and a is known during run time but 
much earlier than 'm' & 'z' have started. Let x, v, z and X be the times taken by processes 
in the event that 'a* happens to force the red path to be taken. Let processes x, v and z be 
mapped onto the same processing unit Let 6, 8, ij be the reconfiguration times for 
swapping between the processes on the unit Given these circumstances, if run time 
scheduling according to [68] is applied, it neglects the reconfiguration times and provides a 
schedule of 5 cycles as shown on the left hand side. But if reconfiguration time were to 
have been considered, a schedule moTe like the one on the right hand side is tighter with 4 
clock cycles. This simple example shows the importance of considering reconfiguration 
time in a reconfigurable processor, if fast swaps of tasks on the processing units need to be 
performed. 

Therefore incorporating Reconfiguration time into Control flow graphs involves the 
following steps: 

i. Special edges are added onto the control flow graphs. These graphs exist between a 
similar set of processes, which will be executed on the same processor with or without 
reconfiguration. 

ii. Reconfiguration times affect the worst-case execution time of loopy codes. So this has 
to be taken care o£ when loopy codes are being scheduled. 

iii. Care needs to be taken to schedule the transfer of reconfiguration bit-stream from the 
main memory to the processor memory. 

4*3.2.4 Loop-based scheduling 

hi static scheduling, loops whose iteration counts are not known at compile time impose 
scheduling problems on tasks which are data dependent on them, and those tasks that have 
resource dependency on their processing unit Therefore, we have considered cases which 
are likely to impact the scheduling to the largest extent and provided solutions. 
Case 1: Solitary loops with unknown execution time. Here, the problem is the execution 
time of the process is known only after it has finished executing in the processor. So static 
scheduling is not possible. 
Solution: 

(Assumption) Once a unit generates an output, this data is stored at the consuming / target 
unit's input buffer. Consider the following scheduled chart (Figure 18). Each row 
represents processes scheduled on a unique type of unit (Processor). Let PI be the loopy 
process. 
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Figure 18: Scheduled Process Charts with Resource and Data Dependency 



From the above figure we see that 

P3 depends on PI and P4, 

P2 depends on PI, 

P6 depends on P2 and P5. 

If Pi's lifetime exceeds the assumed lifetime (most probable lifetime or a unit iteration), 
then all dependents of PI and their dependents (both resource and data) should be notified 
and the respective Network Schedule Manager (NSM) and Logic Schedule Manager (LSM) 
entries delayed. Of course, this implies that while preparing the schedule tables, 2 
assumptions are made. 

1) The lifetimes of solitary loops with unknown execution times are taken as per the 
most probable case obtained from prior trace file statistics (if available and applicable). 
Else unitary iteration is considered. 

2) All processes that are dependent on such solitary loop processes are scheduled with a 
small buffer at their start times. This is to provide time for notification through 
communication channels about any deviation from assumption 1 @ run time. 

If assumption 1 goes wrong, the penalty paid is: 

Consider the example in Figure 15 where 2 processes in the ready list are being scheduled 
based on PCP. Now by PCP method if \v > *b and PI & P2 do not share the same resource, 
then PA is scheduled earlier than PB. WE have assumed that \\ is due to most probable 
execution time of Loop PI. But at runtime if Loop PI executes lesser # of times than 
predicted and therefore resulting in being < Xb, then the schedule of PA earlier than PB 
results in being a mistake. 

We calculate the time difference between both possible schedules. We do not at this point 
propose to repair the schedule because all processes before PI have already been executed. 
And trying to fit another schedule at run time, requires intelligence on the communication 
network which is a burden. But on the brighter side, if @ run time Loop PI executes more 
# of times than predicted, then Xa will still be > Xb- Therefore the assumed schedule holds 
true. 
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and so on. 



Figure 19: Dynamic Entry Updates in the NSM and LSMs 

Case 2: A combination of two loops with one loop feeding data to the other in an iterative 
manner. 

Solution: Consider PA feeding data to PB in such a manner. For doing static scheduling, if 
we loop unroll them and treat it in a maimer of smaller individual processes, then it is not 
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possible to assume unpredictable number of iterations. Therefore if unpredictable number 
of iterations is assumed in both loops, then memory foot-print could become a serious 
issue. But an exception can be made. If both loops at all times run for the same number of 
iterations, then the schedule table must initially assume either the most probable number of 
iterations or one iteration each and schedule PAJPBJPAJPB and so on in a particular 
column. In case the prediction is exceeded or fallen short off, then the NSM and LSMs 
must do 2 tasks: 

1) If the iterations exceed expectations, then all further dependent processes (data and 
resource) must be notified for postponement and notified for scheduling upon the 
iterations completion with an appropriate difference in expected and obtained @ run 
time, schedule times. If the iterations fall short of expectations, then all further 
schedules must only be preponed. 

2) Since the processes PA and PB should denote single iteration in the table, their entries 
should be continuously incremented @ run time by the NSM and the LSMs. The 
increment for one process of course happens for a predetermined number of times, 
triggered off by the schedule or execution of the other process. For example in Figure 
19, we see that PA = 10 cycles, PB = 20 cycles and hence if both loops run for 5 times, 
then the entry in the column increments as shown. 

Only in such a situation can there be preparedness for unpredictable loop iteration counts. 
Case 3: A loop in the macro level i.e. containing more than a single process. 
Solution: In this case, there are some control nodes inside a loop. Hence the execution time 
of the loop changes with each iteration. This is a much more complicated case than the 
previous options. Here lets consider a situation where there is a loop covering 2 mutually 
exclusive paths, each path consisting of 2 processes (A3 & C,D) with (3,7 & 15,5) cycle 
times, hi the schedule table there will be a column to indicate an entry into the loop and 2 
columns to indicate the paths inside the loop. Optimality in scheduling inside the loop can 
be achieved, but in the global scheme of scheduling, the solution is non-optimal. But this 
cannot be helped because to obtain a globally optimal solution, all possible paths have to be 
unrolled and statically scheduled. This results in a table explosion and is not feasible in 
situations where infinite number of entries in table are not possible. Hence, from a global 
viewpoint the loop and all its entries are considered as one entity with the most probable 
number of iterations considered and the most expensive path in each iteration is assumed to 
be taken. For example in the above case, path C,D is assumed to be taken all the time. 
Now, a schedule is prepared for each path and hence entered into the table under 2 
columns. When one schedule is being implemented, the entries for both columns in the 
next loop iteration is predicted by adding the completion time of the current path to both 
column entries (of course while doing this care should be taken not to overwrite the entries 
of the current path while they are still being used). Then when the current iteration is 
completed and a fresh one is started, the path is realized and the appropriate (updated / 
predicted) table column is chosen to be loaded form the NSM to the LSMs. 

4.4 Network architecture 

In order to coordinate the mapping of portions of the schedule table onto corresponding CLUs, 
we propose the following architecture. In Figure 20, the interfacing of the Reconfigurable unit 
with the host processor and other I/O and memory modules is shown. 
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LSM = Logic Schedule Manager; NSM = Network Schedule 

Figure 20: Overview of the System Architecture 



The Network Schedule Manager (Figure 21) has access to a set of tables, one for each 
processor. A table consists of possible tentative schedules for processes or tasks that must be 
mapped onto the corresponding processor subject to evaluation of certain conditional control 
variables. The Logic Schedule manager schedules and loads the configurations for the 
processes that need to be scheduled on the corresponding Processor ie. all processes that 
come in the same column (a particular condition) in the schedule table. In PCP scheduling, 
since the scheduling of the processes in the ready list depends only on the part of the paths 
following those processes, the execution time of the processes shall initially conveniently 
include the configuration time. 

Once a particular process is scheduled and hence removed from the ready list, another 
process is chosen to be scheduled based on the pep criteria again. But this time the execution 
time of that process is changed or rather reduced by using the reconfiguration time, instead of 
the configuration time. Essentially, for the first process that is scheduled in a column, 
the completion time = execution time + configuration time 
For the next or successive processes, 

completion time = predecessor's completion time + execution time + reconfiguration time 
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CLU = Configurable Logic Unit; LU = Logic Units; SN = Switching Network 
CM = Configuration Memory; LSM = Logic Schedule Manager 



Figure 21 : The Internals of the Reconfigurable Unit 

Assuming that once a configuration has been loaded into the CM, the process of putting in 
place the configuration is instantaneous, it is always advantageous to load successive 
configurations into the CM ahead of time. This will mean a useful latency hiding for loading 
a successive configuration. 

The reconfiguration time is dependent on two factors: 

1) How much configuration data needs to be loaded into the CM (Application dependent) 

2) How many wires are there to carry this info from the LSM to the CM (Architecture 
dependent) 

The Network Schedule Manager should accept control parameters from all LSMs. It should 
have a set of address decoders because to send the configuration bits to the Network fabric 
consisting of a variety of switch boxes, it needs to identify their location. Therefore for every 
column in the table, the NSM needs to know the route apriori. We must NOT try to find a 
shortest path at run time. For a given set of processors communicating, there should be a 
fixed route. If this is not done then, the communication time of the edges n the CDFG cannot 
be used as constants while scheduling the graph. 
For any edge the, 

communication time = a constant and uniform configuration time 

+ 

data transaction time. 

The Network architecture consists, of switch boxes and interconnection wires. The 
architecture will be based on the architecture described in [1]. This will be modeled as a 
combination of behavioral" & "Structural" style VHDL. Modifications that will be made 
are: 

a. The Processing Elements derived in section 3 will be used instead of the 4 input LUTs 
that were used in Andre's model. 

b. RAM style address access will be used to select a module or a switch box on the circuit. 
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c. Switch connections that are determined to be fixed for an application will be configured 
only once (at the start of that application). 

d. Switch connections that are determined to be fixed for all applications will be shorted and 
the RC model for power consumption for that particular connection will be ignored for 
power consumption calculations. 

e. . The # of hierarchy levels will be determined by the application that has the maximum # 

of modules, because there is a fixed number of modules that can be connected 
There will be 1 Network Schedule Manager (NSM) modeled in <r BehavioraT & "Structural" 
style VHDL will store the static schedule table for the currently running application. The 
NSM collects the evaluated Boolean values of all conditional variables from every module. 
For placing modules on the network 2 simple criteria are used. These are based on the 
assumption that the network consists of Groups of 4 Processing Unit Slots (G4PUS) 
connected in a hierarchical manner. 
Note: A loop could include 0 or more number of CGPEs. 

Therefore the following priority will be used for mapping modules onto the G4Pus: 
a* A collection of 1 to 4 modules which are encompassed inside a loop shall be mapped to a 
G4PUS. 

i. If there are more than 4 modules inside a loop, then the next batch of 4 modules are 
mapped to the next (neighboring) G4PUS. 

ii. If # of CGPEs in a loop ^2, then they will have greater priority over any FGPEs in 
that loop for a slot in the G4PUS. 

b. For all other modules: 

iii. CGPE Modules with more than 1 Fan-in from other CGPEs will be mapped into a 
G4PUS. 

iv. CGPE Modules with more than 1 Fan-in from other FGPEs will be mapped into a 
G4PUS. 

Note: The priorities are based on the importance for amount of communication between 
modules. Both Fan-ins and Fan-outs can be considered, for simplicity, we choose Fan-ins to 
CGPEs only. 

5 Testing Methodology 

In this research effort, we will focus mainly on reducing the number of reconfigurations that need 
to be made for running an application and then running other applications on the same processor. 
We also aim to reduce the time required to load these configurations from memory in terms of 
the number of configuration bits corresponding to the number of switches. 
Time to execute an application for a given area (area estimate models of XILINX FPGAs and 
Hierarchical architectures can be used for only the routing portion of the circuit.) and a given 
clock frequency can be measured by simulation in VHDL. 

The Time taken to swap clusters within an application and swap applications (reconfigure the 
circuit from implementing one application to another) is dependent on the similarity between the 
successor and predecessor circuits. We will measure die time to make a swap, in terms of # of 
bits required for loading a new configuration. Since a RAM style loading of configuration bits 
will be used, it is proven [2] to be faster than serial loading (used in Xilinx FPGAs). We expect 
speed up over the RAM style due to 2 reasons: 

c) The address decoder can only access one switch box at a time. So the greater the 
granularity of the modules, the fewer the number of switches used and hence configured. 
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d) Compared to peer architectures which have only LUTs or a mixture of LUTs and CPGEs 
with low granularity (MAC units), we expect to have CGPEs of moderate granularity for 
abstract control-data flow structures in addition to FGPEs. Since these CPGEs are 
derived from the target applications, we expect their granularity to be the best possible 
choice for a ^configurable purpose. They are modeled in "Behavioral" VHDL and are 
targeted to be implemented as ASICs. This inherently would lead to a reduced amount of 
configurations. 

The Time taken to execute each application individually will be compared to available estimates 
obtained for matching area and clock specifications from work carried out by other researchers. 
This will be in terms of number of configurations per application, number of bits per 
configuration, number of configurations for a given set of applications and hence time in seconds 
for loading a set of configurations. 

Note on power consumption: Sources of Power consumption for a given application can be 
classified into 4 parts: 

b. Network power consumption due to configurations with an application. This is due to the 
Effective Load Capacitance on a wire for a given data transfer from one module to 
another for a particular configuration of switches. 

Note: The more the number of closed switches a signal has to pass through, the more the 
effective load capacitance and resistance. Shorted switches will not be considered to 
contribute to this power. 

c. Data transfer into and out of the Processor 

Note: This can have a significant impact on the total power in media rich or 
communication dominated applications ported onto any processing platform. 

d. Processing of data inside a module. 

Note: This will require synthesizable VHDL modules. But since our focus in this research 
work is on reducing power due to reconfiguration, we will leave this for future work. 

e. The Clock distribution of the processor. 

Note: This can be measured if the all parts of the circuit are synthesizable. But we are 
focusing on a modeling aspect and do not consider this measurement 

At the level of modeling a circuit in VHDL, it is possible to only approximately determine the 
power consumptions. We will use the RC models of XILINX FPGAs and [1] architectures to get 
approximate power estimates. Power aware scheduling and routing architecture design are 
complex areas of research in themselves and is not the focus of this research effort. In this thesis 
we focus on reducing the amount of reconfigurations, which directly impacts the speed of the 
processor and indirectly impacts the power consumption to a certain extent 

We will compare the performance of our processor for each of the applications with available 
estimates (those published in literature) in terms of Time (execution time for an application at a 
given clock frequency). 

We cannot compare the area parameter because the architecture being used is Rent's Hierarchical 
Model which was proposed in [1]. An optimized architecture for a given set of applications is 
beyond the scope of this thesis. Although the modules are customized based on the clustering 
approach, yet they are not in a synthesized form and hence their power consumption and area 
occupied are not determinable with the existing EDA tools. 
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6 Conclusions & Ongoing Research 

After a detailed analysis of the current approaches towards designing a dynamic and fast 
reconfigurable processor, we have proposed a methodology consisting of an automated approach 
towards identifying reconfigurable regions in applications. We have also proposed a criterion for 
selection of the number of processing units of a given type to extract the maximum amount of 
resource utilization. Thereafter a scheduling strategy has been discussed that maps the tasks onto 
the computing resources on the processor. We have proposed to select a collection of algorithms 
from areas of media processing and computer graphics to test the methodology. Currently work 
is in progress in refining the individual algorithms for graph matching and scheduling. With the 
completion of the set of tools proposed, an automated approach towards developing dynamically 
reconfigurable processors will be available. 
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8 Appendix A 

A Control Data Flow Graph consists of both data flow and control flow portions; In compiler 
terminology, all regions in a code that lie in between branch points are referred to as Basic 
Blocks. Those basic blocks which have additional code due to code movement, shall be referred 
to these as zones because. Also under certain conditions, decision making control points can be 
integrated into the basic block regions. These blocks should be explored for any type of data 
level parallelism they have to offer. Therefore for simplicity in the following description, basic 
blocks are referred to as zones. The methodology remains the same when modified basic blocks 
and abstract structures such as nested loops and hammock structures etc are considered as zones. 

High level ASNI C code of the target application is first converted to an assembly code 
(UltraSPARC). Since the programming style is user dependent, the assembly code needs to be 
expanded in terms of all functions calls. To handle the expanded code, a suitable data structure 
that has a low memory footprint is utilized. Assembly instructions that act as delimiters to zones 
must then be identified. The data structure is then modified to lend itself to a more convenient 
form for extracting zone level parallelism. 

The following are the steps involved in extracting zone level parallelism. 
Step-1: Parsing the assembly files 

In this step for each assembly (.s) file a doubly linked list is created where each node stores 

one instruction with operands and each node has pointers to the previous and next 

instructions in the assembly code. Parser ignores all commented out lines, lines without 

instructions except the labels such as 

Main: 

XL3: 

Each label starting with XL is replaced with a unique number (unique over all functions) 
Step-2: Expansion 

Each assembly file that has been parsed is stored in a separate linked list. In this step the 
expander moves through the nodes of linked list that stores main.s. If a function call is 
detected that function is searched through all linked lists. When it is found, that function 
from the beginning to the end, is copied and inserted into the place where it is called. Then 
the expander continues moving through the nodes from where it stopped. Expansion 
continues until the end of main.s is reached. Note that if an inserted function is also calling 
some other function expander also expands it until every called function is inserted to the 
right place. 

In the sample code (Appendix B), mainO function is calling the findsumO function twice 
and findsumO function is calling the findsubO function. The expanded code (after 
considering individual assembly codes (Appendix C) is shown in Appendix-D. 

Step-3: Create Control Flow Linked List 

Once the main.s function has been expanded and stored in a doubly linked list, the next step 
is to create another doubly linked list (control Jlow Jinked Jist) that stores the control flow 
information. This will be used to analyze the control flow structure of the application code, to 
detect the starting and ending points of functions and control structures (loops, if.. else 
statements, etc.). 
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As the expanded linked list is scanned, nodes are checked if they belong to a: 

• Label or 
Function or 

• Conditional or 

• unconditional branch 

In which case, a new node is created to be appended to the control flow linked list by setting 
the member pointers as defined below. 

If the current node is a 

• function label 

A pointer to die expanded list pointing to the function label node 

A pointer to the expanded list pointing to the beginning of the function (the next node of the 
function label node) 

A pointer to the expanded list pointing to the end of the function 
And node type is set to "function". 

• label 

A pointer to the expanded list pointing to the function label node 

A pointer to the expanded list pointing to the beginning of the label (the next node of the 
label node) 

And node type is set to "square". 

• unconditional branch (b) 

A pointer to the expanded list pointing to the branch node 

A pointer to the control flow linked list pointing to the node that stores the matching target 
label of the branch instruction. 
And node type is set to "dot" 

• conditional branch (bne, ble, bge, ...etc) 

A pointer to the expanded list pointing to the branch node 

A pointer to the control flow linked list pointing to the node that stores the matching target 
label of the branch instruction. 
And node type is set to "circle" 

The control flow linked list output for the findsum.s function is shown in Appendix D. 
Step 4: Modification of Control Structure 

The control structure linked list (which essentially represents the control flow graph of the 
candidate algorithm) is then modified as follows. 

• The pointers from unconditional branch nodes (also called "dot" nodes) to the next node 
in the list need to be disconnected and made NULL. Hence for the "dot" node: 

node-* next = NULL 
for the following node: 
node-* previous = NULL 

{Exception: if the next node of the "dot" node is itself the target node ! } 
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The target nodes of the unconditional branches need to be marked as "Possible Exit** 
nodes. These "Exit" classes of nodes are a subset of the regular 'Target" or "Square" 
nodes. 

If unconditional branch node's rank is higher than target node's rank (indicating a feed 
back or loop), disconnect the link and mark as NULL. 
Hence for the "dot" node: 
node-> to_target = NULL 

But before disconnecting, mark target-* next (which should be a circle) as "loop 
node". 

In a special case, if an unconditional branch and a square share the same node, then the 
target of mat unconditional branch is declared as an exit square with a loop type (because, 
instructions following this square, comprise the meat of the do-while loop). This exit 
square, will not have its next-* pointing to a circle. The circle is accessed through the 
dot node using the previous-* pointer. Then it is marked off as type loop. 
If a "Possible Exit" node has 2 valid input pointers, and rank of both source pointers is 
lesser than the node in consideration, then it is an "Exit" node and, disconnect the link to 
the corresponding "dot" node, and hence also mark that "dot" node's target pointer to 
NULL. In other words, if the node-* previous pointer of the "square/target" node of the 
"dot" node does not point to the "dot" node, then it has 2 valid pointers. 
Hence for the "dot" node: 
node-* to_target = NULL 

For a sample high level code in the Figure 1 below, following which is the expanded 
assembly file. The control flow linked list is as shown in Figure 2. After modifications to 
this linked list a structure as indicated in figure 3 is obtained. 



#include<stdio.h> 
void mainO 
{ 

int 

i=0j=O,k=O,l=0^H=0^=0,p=0^=0; 

foi(i=l;i<10a++) 
{ 

p=p-8; 
P = P*7; 

i = i+l; 
{ 

n = 9; 

if(k>0) 

{ 

P=19; 

> 



{ 

r = 23; 

} 

n= 17 + 8: 



else 
{ 

1=10; 
m=n+r, 

} 

k = k-14; 

k=7-8»p; 

while(i<p) 

{ 

p = p*20; 
p = p-7; 
while(k=8) 

{ 

p = p+17; 
i = i*p; 

> 

p= p - 23; 

> 

m = m +5; 
n = n+4; 

} 



Figure 1 : An Example Program 
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The gcc (version 2.95.2) compiled code for die UltraSPARC architecture with node labeling 
is as follows: 

.file "loop_pattern4.c" 
gcc2_compiled. : 

.global .uraul 
.section ".text" 

.align 4 

.global main 

.type main,#function 

.proc 020 



mam: 



!#PROLOGUE#0 
save %sp, -144, %sp 
!#PROLOGUE# 1 
st 
st 
st 
st 
st 
st 
st 
st 

mov 
st 



.1X3: 



.1X6: 



XLS: 



%g0,[%ft>-20] 
%g0, [%fp-24] 
%g0, [%fp-28] 
%g0, [%fp-32] 
%g0, [%tp-36] 
%g0, [%fp-40] 
%g0, [%rp-44] 
%g0, [%fp-48] 
l,%o0 

%o0, [%rp-20] 

Id [%rp-20], %o0 

cmp %o0, 9 

ble XL6 
nop 

b .1X4 
nop 

Id [%fp-44], %o0 

add %o0, -8, %ol 

st %ol, [%tp-44] 

Id [%fp-44],%o0 

mov %o0, %ol 

sll %ol, 3, %o2 

sub %o2, %o0, %o0 

st %o0, [%fp-44] 

Id [%fp-20],%o0 
add %o0, 1, %ol 
st %ol, [%Q)-20] 



ground 



square 3 



circle 6 
dot 4 



square 6 



square 5 
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b .1X3 
nop 

.1X4: 

' Id [%fr-20],%o0 

add %o0, 1, %ol 

st %ol, [%fc-20] 

Id [%fp-20], %oO 

Id [%fp-24], %ol 

cmp %o0,%ol 

bne XL7 
nop 

mov 9, %o0 

st %o0, [%fp-40] 

Id [%fp-28], %oO 

cmp %o0, 0 

ble XL8 

nop 

mov 19, %o0 

st %o0, [%$-44] 

b XL9 
nop 

.1X8: 

mov 23, %oO 

st %oO, [%fr-48] 

.1X9: 

mov 25, %oO 

st %oO, [%fp-40] 

b .1X10 
nop 

.1X7: 

mov 10, %o0 

st %o0, [%fp-32] 

Id [%fp-40], %o0 
Id [%Q>-48], %ol 

add 0 /o60,%ol,%oO 
st %oO, [%$-36] 

XL10: 

Id [%fp-28], %o0 
add %o0, -14, %ol 
st %ol, [%fp-28] 
Id [%fr-44], %oO 
mov %oO, %ol 
sll %ol, 3, %oO 
mov 7, %ol 
sub %ol, %oO, %oO 
.st %oO, [%fp-28] 

.1X11: 



dot 3 

square 4 
circle 7 

circle 8 

dot 9 

square 8 

square 9 
dot 10 

square 7 

square 10 



.LL13: 



.LL14: 



.LL16: 



.LL15: 



Id 
Id 

cmp 
bl 
nop 
b 

nop 
Id 

mov 

sll 

add 

sll 

st 

Id 

add 

st 

Id 

cmp 
be 
nop 
b 

nop 

Id 

add 

st 

Id 

Id 

call 

nop 

st 

b 

nop 



.LL12: 



Id 
add 
st 
b 

nop 

Id 

add 

st 

Id 

add 



[%fc-20],%o0 
[%fp-44], %ol 
%o0,%ol 
.LL13 

.LL12 



[%fp-44],%o0 
%o0,%o2 
%o2,2,%ol 
%ol, %o0, %ol 
%ol, 2, %o0 
%o0, [%fr-44] 
[%fp-44], %o6 
%o0,-7,%ol 
%ol, [%ip-44] 

[%fp-28],%oO 
%o0, 8 
.LL16 

.LL15 



[%fp-44],%o0 
%o0, 17, %ol 
%ol, [%fp-44] 
[%fp-20], %o0 
[%fp-44], %ol 
.uraul, 0 

%o0, [%fp-20] 
XL14 



[%Q>-44], %o0 
%o0, -23, %ol 
%ol, [%fi>-44] 
.LL11 



[%fp-36], %o0 
%oO, 5, %ol 
%ol, [%fr-36] 
[%fp-40], %oO 
%o0,4,%oi 



square 11 



circle 13 
dot 12 



square 13 



square 14 
circle 16 
dot 15 

square 16 



dot 14 



square 15 



dot 11 



square 12 



st °/ool, [%fc-40] 

.LL2: 

ret square 2 

restore 

XLfel: 

.size main,.LLfel-main 

.ident "GCC: (GNU) 2.95.2 19991024 (release)" 
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Step 5: Creation of Zones 

To extract all possibilities of parallelism and reconfiguration, zones are identified in the 
modified structure. But to identify such sections, delimiters are needed. A delimiter can be 
any of the following types of nodes: 

(i) Circle 

(ii) Dot 

(iii) Exit square 

(iv) Square 

(v) Power 

(vi) Ground 

A 'Circle' can indicate the start of a new zone or the end of a zone. A 'Dot' can only indicate 
the end of a zone or a break in a zone. An 'Exit square' can indicate the start of a new zone 
or the end of a zone. A 'Square' can only indicate the continuation of a break in the current 
zone. A Tower* can only indicate the beginning of the first zone. A 'Ground' can only 
indicate the end of a zone. 

Figure 4 shows example zones to illustrate the use of delimiters. Three zones, 1, 2, and 3 all 
share a common node, 'Circle 6'. This node is the end of Zone 1 and the start of zones 2 and 
3. Zone 1 has the 'Power' node as its start, while Zone 6 has 'Ground' node as its end. The 
'Dot 3' in Zone 3 indicates the end of that zone while 'Dot 4* indicates a break in Zone 2. 
This break is continued by 'Square 4'. In Zone 4, 'Square 9' indicates the end of the zone 
while it marks the start of Zone 5. 

This function identifies zones in the structure, which is analogous to the numbering system in 
the chapter page of a book. Zones can have sibling zones (to identify if/else conditions, 
where in only one of the two possible paths can be taken {Zones 4 and 7 in Figure 1}) or 
child zones (to identify nested control structures {Zone 10 being child of zone 8 in Figure 
1}). Zone types can be either simple or loopy in nature (to identify iterative loop structures). 
The tree is scanned node by node and decisions are taken to start a new zone or end an 
existing zone at key points such as circles, dots and exit squares. By default, when a circle is 
visited for the first time, the branch taken path is followed. But this node along with the 
newly started zone is stored in a queue for a later visit along the branch not taken path. When 
the structure has been traversed along the "branch taken" paths, the nodes with associated 
zones are popped out from the stack and traversed along their "branch not taken" paths. This 
is done till all nodes have been scanned and stack is empty. 
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Hie Pseudo code for the above process is given below: 
Global variables: pop flag = 0, tree_empty = 0; 

Zonise (node) /* input into the function is the current node, a starting node */ 

while (tree_empty — 0) /* this loop goes on node by node in the tree till all node 

.have been scanned */ 

{ 

if (node -* type = circle) 

if (pop_flag != set) /* pop flag is set when a pop operation is done */ 

/* an entry here means that the circle was encountered for the first 
time*/ 

/* so set the node-* visited flag */ 
/* close the zone */ 

/* since u r entering a virgin circle, u cant create the new zone as a 

sibling to the one u just closed */ 
/* if the zone u just closed, has a valid Anchor Point and if its of 

type Loop and if its visited flag is set, then u cannot create a 

child zone */ 
/* accordingly create a new zone */ 
/* set child as current zone*/ 
/* push this zone and the node into the queue */ 
/* take the taken path for the node, i.e node = node-* taken */ 

} 

if (pop_flag = set) 

/* an entry here means, that we r visiting a node and its associated 
zone, that have just been popped out form the queue, hence 
revisiting an old node */ 

/* since this node has its visited flag as set, change that flag value 
to -1, so as to avoid any erroneous visit in the future */ 

/* if node is of type Non Loop, then spawn a new sibling zone */ 
/* if node is of type Loop, then spawn new zone as laterparent zone 

and mark zone type as loop*/ 
/* choose the not taken path for the node */ 

} 

} 

else if (node-* type = exit square) 
{ 

/* close the zone */ 

/* if the closed zone has a parent, i.e zone-* parent pointer is not NULL, 
then create a new zone with link to the parent zone as type next zone */ 

/* if the closed zone does not have a parent, then spawn a new zone that is 
next to the closed zone */ 
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/♦choose the not taken path for the node*/ 

} 

else if (node-* type is dot and node-* taken = NULL) 
{ 

/* close zone */ 

/* choose node to be considered next by popping out from the queue */ 
/* in case the queue is empty, all nodes in tree have been scanned */ 
/* set pop flag */ 

} 

else if (node-* type = dot and node-* taken != NULL) 

/* this is just a break in the current zone */ 
/* create temp stopl and tempstartl pointers*/ 
/* choose node-* taken path */ 

} 

}/* end of the first while loop */ 

} 

Once the zones have been identified in the structure, Certain relationships can be observed 
among them. These form the basis of extraction' of parallelism at the level of zones. A zone 
inside a control structure is the 'later child' of the zone outside the structure. Hence the zone 
outside a control structure and occurring before (in code sequence) the zone inside a control 
structure is a •former parent' of the zone present inside. But, the zone outside a control 
structure and occurring after (in code sequence) the zone inside the structure is referred to as 
the 'later parent'. Similarly the child in this case would be a 'former child*. A zone occurring 
after another zone and not related through a control structure is the 'next' of the earlier one. 
After parsing through the structure thru the zonal relationship as shown in Figure 5 is 
obtained. 
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S: sibling relationship 
LC: later child relationship 
Lp: later parent relationship 

In all types, destination zone is (lc/s/lp) of source zone 
The shaded zones are Loop types. 



Figure 5: Initial Zone Structure obtained 

This is referred to as the 'initial zone structure*. The term initial, is used because, some links 
need to be created and some existing ones, need to be removed. This process is explained in 
the section below. 

Step 6: Further Modification of the 'initial zone structure 9 

Some of the relationships that were discussed in the previous step cannot exist with the 

existing set of links and others are redundant. For example in in Figures, we see that Zl can 

be connect to Z2 thru *n' 

Z12 can be connected to Z13 thru *lp' 

Z13 can be connected to Z6 thru *n* 

Z8 can be connected to Z9 thru 'n' 

Z4 can be connected to Z5 thru *lp* 

Z5 can be connected to Zl 3 thru 4 lp* 

Z7 can be connected to Z5 thru 'lp' 

But Z8*s relationship to Z6 thru 4 lp' is false, coz no node can have both 'n' and 'lp' links. 
In such a case, the *lp' link should be removed. 

Therefore some rules need to be followed to establish *n* and Up* type links, if they don't 
exist. 

To form an *n* link: 
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If a zone (1) has an 'lc' link to zone (2), and if that zone (2) has a 'Ip* link to a zone (3), then 
an c n* link can be established between 1 and 3. This means that if zone (1) is of type •loop*, 
then zone (3) will now be classified as type 'loop' also. 

To form an 'Ip* type links if it doesn't exist: 

If a zone (1) has an *fp* link to zone (2), and if that zone (2) has an *n' link to a zone (3), then 
an 'lp' link can be established between 1 and 3 

If a zone (1) has an *lp* link to zone (2), and also has an *n' link to zone (3), then first, 
remove the 'lp' link 'to zone (2)' from zone (1) and then, place an 'lp' link from zone (3) to 
zone (2). 

This provides the 'comprehensive zone structure* as shown in Figure 6 (with cancelled links) 
and in Figure 7 (with all cancelled links removed). 




Figure 6: Comprehensive Zone structure with cancelled links shown 




Figure 7: Comprehensive Zone structure with cancelled links removed 
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To identify parallelism and hence compulsorily sequential paths of execution, the following 
approach is adopted. Firstly, the comprehensive zone structure obtained, is ordered 
sequentially by starting at the first zone and traversing along an *lc - lp' path. If a Sibling link 
is encountered it is given a parallel path. The resulting structure is shown in Figure 8. 
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Figure 8: Sequentially Ordered Zones 

To establish parallelism between a zone (1) of loop count A and its upper zone (2) of loop 
count B, where A < B, check for data dependency between zone 1 and all zones above it 
upto and including the zone with the same loop count as zone 2. 

In the example above, to establish parallelism b/w zone 6 and zone 9, check for dependencies 
b/w zone 6 and 9, 10, 8. If there is no dependency then zone 6 is parallel to zone 8. 
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To establish parallelism between a zone (1) of loop count A and its upper zone (2) of loop 

count B, where A = B, direct dependency check needs to be performed 

To establish parallelism between a zone (1) of loop count A and its upper zone (2) of loop 

count B, where A > B, direct dependency check needs to be performed. Then, the zone (1) 

will now have to have an iteration count of (its own iteration count zone (2)*s iteration 

count). 

When a zone rises like a bubble and is parallel with another zone in the primary path, and 
reaches a dependency, it is placed in a secondary path. No bubble in the secondary path is 
subjected to dependency testing. 

After a bubble has reached its highest potential, and stays put in a place in the secondary 
path, the lowest bubble in the primaiy path is checked for dependency on its upper fellow. 
If the upper bubble happens to have a different loop count number, then as described earlier, 
testing is carried out In case a parallelism cannot be obtained, then this bubble, is clubbed 
with the set of bubbles ranging from its upper fellow, till and inclusive of the bubble up the 
chain with the same loop count as its upper fellow. A global i/o parameter set is created for 
this new coalition. Now this coalition will attempt to find dependencies with its upper fellow. 
The loop count for this coalition will be bounding zone's loop count Any increase in the 
iteration count of this coalition will reflect on all zones inside it In case a bubble wants to 
rise above another one which has a sibling/ reverse sibling link, there will be speculative 
parallelism. 

The algorithm should start at multiple points, one by one. These points can be obtained by 
starting from the top zone and traversing down, till a sibling split is reached. Then this zone 
should be remembered, and one of the paths taken. This procedure is similar to the stack 
saving scheme used earlier in the zonise function. 

Another Pre-processing step is used that loop unrolls every iterative segment of a CDFG that 
does not have conditional branch instructions inside it and whose iterative count is known at 
compile time. 

9 Appendix B 

# inc lude < s tdio . h> 
void mainO 

{ 

int i,j,k,l; 



i = 10; 
j = 1* 4; 

if ( j > 5 > 

k=findsum(i, j) ; 
1 o 4+k; 

} 

else 

k a f indsum(i, j) ; 
1 = k*10; 

} 
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} 



int findsum(int a.int b) 
int i,j,k; 
k=4 r 

for(i=0;i<10;i++) 
k - k + l* 

i 

j = findsub(k, a) ; 
return j ; 

} 



int findsub(int x,int y) 
{ 

int t; 
t a x-y; 
return (t) ; 

} 
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Main.s 

.file "main.c" 
gcc2_compiled. : 
. section " . text " 

.align 4 

•global main 

. type main, # function 

.proc 020 
main : 

l#PROIiOGUE# 0 

save %sp, -128, %sp 

!#PROLOGUE# 1 

mov 10, %o0 

at %o0, [%fp-20] 

mov 4 f %o0 

St %o0, [%fp-24] 

Id [%fp-24] , %00 . 

cmp " %o0, 5 

ble . LL3 

nop 

Id t%fp-20], %o0 
Id [%fp-24], %ol 
call findsum, 0 
nop 

st %o0, [%fp-28] 
Id [%fp-28], %o0 



add 


%o0 , 4 , %ol 
%ol, [%fp-32] 


st 


b 


.LXi4 


nop 








Id 


[%fp-203 , %O0 


Id 


[%fp-24] , %Ol 


call 


findsum, 0 


nop 


%o0, [%fp-28] 


st 


Id 


[%fp-28] , %o0 


mov 


%O0 , %o2 


all 


%02 # 2 f %Ol 


add 


%Ol, %O0, %Ol 


all 


%Ol, 1, %O0 


st 


%o0, [%fp-32] 


.Uj4: 








ret 





restore 
.LLfel: 

• size main, .IiLf el-main 

.ident "GCC: (GNU) 2.95.2 19991024 (release)" 



Findsura^ 

•file "findsum. c w 
gcc2_compiled. : 
.section ".text" 

.align 4 

.global findsum 

.type findsum, #f unction 

•proc 04 
findsum: 

!#PROLOGUE# 0 

save %sp # -128, %sp 

!#PROLOGUE# 1 

st %i0, [%fp+68] 

St %il, [%fp+72] 



mov 


4, %o0 


st 


%00, [%fp-28] 


st 


%go, [%fp-20] 


. LL3 : 




Id 


[%fp-20] , %o0 


cmp 


%o0 f 9 


ble 


. LL6 


nop 




b 


• LL4 


nop 




. LL6 : 




Id 


[%fp-28] , %o0 


add 


%O0, 1, %ol 
%Ol, [%fp-28] 


st 


.LL5: 




Id 


[%fp-20j , %00 


add 


%O0, 1, %ol 
%Ol, [%fp-20] 


st 


b 


.Ui3 



nop 



.LL4: 




Id 


[%rp-28J , %ou 


Id 


[%fp+68] , %ol 


call 


findsub, 0 


nop 


%o0, [%fp-24] 
[%fp-24] , %o0 


St 


Id 


mov 


%o0, %i0 


b 


.1*1*2 


nop 




. LL2 : 




ret 





restore 
.LLfel: 

.size findsum, • LLfel- findsum 

.ident "GCC: (GNU) 2. 95. 2 19991024 (release)" 



Findsub^ 

.file "findsub. c n 
gcc2_compiled. : 
.section ".text" 

.align 4 

•global findsub 

. type findsub, #f unction 

.proc 04 
findsub: 

!#PROLOGUE# 0 

save %sp , - 12 0 , %sp 

l#PR0LOGUE# 1 



St 


%i0, [%fp+68] 


St 


%il, [%fp+72] 


Id 


l%fp+68] , %o0 


Id 


[%fp+72l , %ol 


sub 


%O0, %ol, %o0 


St 


%o0, [%fp-20] 


Id 


[%fp-20] , %00 


mov 


%o0, %i0 


b 


• LL2 


nop 




.LL2: 




ret 





restore 
.LLfel: 

•size findsub, .LLfel -findsub 

.ident "GCC: (GNU) 2.95.2 19991024 (release)" 
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Expanded 'main function 
Function main BEGINS here 
save %sp -128 %sp 



• 



mov 10 %o0 

at %o0 [%fp-20] 

mov 4 %o0 

St %00 [%fp-24] 

Id [%fp-24] %o0 

cmp %o0 5 

ble 0 

nop 

Id [%fp-20l %o0 

Id [%fp-24] %ol 
Function f indsum BEGINS here 

save %sp -128 %sp 

st %10 [%fp+68] 

st %il [%fp+72] 

mov 4 %o0 

st %o0 [%fp-28] 

st %g0 [%fp-20] 

4 

Id [%fp-20] %o0. 

cmp %o0 9 

ble 5 

nop 

b 6 

nop 

S 

Id [%fp-28] %o0 
add %o0 1 %ol 
st %ol [%fp-28] 
7 

Id [%£p-20] %o0 

add %o0 1 %ol 

st %pl [%fp-20] 

b 4 

nop 

6 

Id [%fp-28] %o0 
Id [%fp+68] %ol 
Function findsub BEGINS here 

save %sp -120 %sp 
st %i0 [%fp+68] 
st %il [%fp+72] 
Id [%fp+68] %o0 
Id [%fp+72] %ol 
sb %o0 %ol %O0 
St %o0 [%fp-20] 
Id [%fp-20] %00 
mov %o0 %i0 
b 10 
nop 
10 
ret 

restore 
11 

Function findsub ENDS here 
findsub .LLfel- findsub 
nop 

st %o0 [%fp-24] 
Id [%fp-24] %o0 
mov %o0 %i0 



• 



b 8 

nop 

8 

ret 

restore 
9 

Function f indsum ENDS here 
f indsum . LLf el- f indsum 
nop 

st %o0 [%fp-28] 

Id [%fp-2B] %o0 

add %o0 4 %ol 

st %ol [%fp-32l 

b 1 

nop 

0 

Id [%fp-20] %o0 
Id [%fp-24] %ol 
Function f indsum BEGINS here 

save %sp -128 %sp 

St %i0 [%fp+68] 

st %il [%£p+72] 

mov 4 %o0 

st %o0 [%£p-28] 

st %g0 l%fp-20] 

4 

Id [%fp-20] %o0 

cmp %o0 9 

ble 5 

nop 

b 6 

nop 

5 

Id [%fp-28] %o0 
add. %o0 1 %ol 
st %ol [%£p-28] 
7 

Id [%fp-20] %o0 

add %o0 1 %ol 

st %ol [%fp-20] 

b 4 

nop 

6 

Id [%fp-28] %o0 
Id [%fp+68] %ol 
Function f indsub BEGINS here 

save %sp -120 %sp 
st %i0 [%fp+68] 
st %il [%fp+72] 
Id [%fp+68] %o0 
Id [%fp+72] %pl 
sb %o0 %ol %o0 
st %o0 [%£p-20] 
Id [%fp-20] %o0 
mov %o0 %i0 
b 10 

no P 

10 

ret 



restore 
11 

Function f indsub ENDS here 
findsub .LLfel-f indsub 
nop 

st %o0 [%fp-24] 

Id [%fp-24] %o0 

mov %o0 %i0 

b 8 

nop 

8 

ret 

restore 
9 

Function find sum ENDS here 
findsum .LLf el-f indsum 
nop 

st %o0 [%fp-28] 
Id [%fp-28] %00 
mov %o0 %o2 
Sll %02 2 %ol 
add %ol %o0 %ol 
sll %ol 1 %o0 
St %o0 [%fp-32] 
1 
2 

ret 

restore 
3 

Function main ENDS here 
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Control flow linked list 



Main linked list 



to main — 




to target 




begins ' — 




ends 




.* t 


to main 




s to_target 




begins 




ends 






to main / 


to_target 




begins 




ends 


y 



[%fp-28] %o0 
%o0 1 %ol 

%ol [%fp-28] 

[%fp-20] %o0 
%o0 1 %ol 

%ol [%fp-20] 



st %g0 [%fp-20] 
4 

Id [%fp-20] %o0 
cmp %o0 9 
ble 5 
nop 
b 6 
nop 
5 

Id 
add 
st 
Id 
add 
st 

b 4 
nop 
6 

Id [%fp-28] %o0 
Id [%fp+68] %ol 
Function f indsub BEGINS here 
save %sp -120 %sp 
st %i0 [%fp+68] 
st %il [%fp+72) 
Id [%fp+68] %o0 
Id [%fp+72] %ol 
sb %o0 %ol %o0 
st %o0 t%fp-20) 
ret 

restore 

Function f indsub ENDS. here 
f indsub .LLfel-f indsub 
nop 

st %o0 [%fp-24] 



75 



# 



13 Appendix F 

In this section the pseudo ANSI C codes for the test-bench algorithms aire presented. 

Note: For an indepth-analysis and explanation on all graphics algorithms, please refer to the book: 
"Computer Graphics: Principles and Practise" Second edition in C, by Foley, van Dam, Feiner and 
Hughes. 

Cohen Sutherland Line Clipping 
typedef unsigned int outcode; 

enum {TOP=0xl, BOTTOM=0x2, RIGHT=0x4, LEFT=0x8}; 

void CohenSutherlandLineCiipAndDraw ( 

double xO, double yO, double xl, double yl, double xmin, double xmax, 

double ymin, double ymax, int value) 
/* Cohen-sutherland clipping algorithm for line PO = (xO,yO) to PI = (xl,yl) and */ 
/* clip rectangle with diagonal from (xmin,ymin) to (xmax, ymax) */ 
{ 

/* Outcodes for PO, PI and whatever point lies outside the clip rectangle */ 

outcode outcodeO, outcode 1, outcodeOut; 

boolean>aacept = FALSE, done = FALSE; 

outcodeO = CompOutCode (xO,yO,xmin,xmax,ymin,ymax); 

outcodel = CompOutCode (xl ,y 1 ,xmin,xmax,ymin,ymax); 

do { 

if ({(outcodeO | outcodel)) { 

accept = TRUE; done = TRUE; 
} else if (outcodeO & outcodel) , 

done = TRUE; 

else { 

double x,y; 

outcodeOut = outcodeO?outcodeO:outcodel; 
if(outcodeOut&TOP) { 

x = xO + (xl -xO)*(ymax-yO)/(yl -yO); 

y = ymax; 
} else if (outcodeOut & BOTTOM) { 

x = xO + (xl- xO)*(ymin - yO) / (yl - yO); 

y = ymin; 
} else if (outcodeOut & RIGHT) { 

y = yO + (yl- yO)*(xmax - xO) / (xl - xO); 

x = xmax; 

} else { 

y = yO + (yl- yO)*(xmin - xO) / (xl - xO); 
x = xmin; 

} 

if (outcodeOut = outcodeO) { 



76 



xO = 
(xO,yO,xmin,xmax,ymin,ymax); 

} eke { 

xl = 
(x 1 ,y 1 ,xmin ,xmax ,ymin,ym ax) ; 

> 

} 

} while (done = FALSE); 



x; yO = 
x; yl 



outcodeO = 



outcodel = 



CompOutCode 
CompOutCode 



} 



if(accept) 

MidpointlineReal (xO,yo,xl,yl,value); 



outcode CompOutode ( 

double x, double y, double xmin, double xmax, double ymin, double ymax) 



{ 



outcode code = 0; 
if (y<ymax) 

code |= TOP; 
else if (y<ymin) 

code |= BOTTOM; 
if (x>xmax) 

code |= RIGHT; 
else if (x<xmin) 

code |= LEFT; 
return code; 



} 



void MidpointLineReal (double xO.double yo,double xl, double yl,double value) 
{ 

double dx = xl - xO; 
double dy = yl - yO; 
double d = 2*dy - dx; 
double incrE = 2*dy; 
double incrNE = 2*(dy - dx); 
double x = xO; 
double y = yO; 
WritePixel (x,y,value); 

while (x<xl) { 

if (d<=0) { 

d+= incrE; 
x++; 

} else { 

d += incrNE; 

x-H-; 

y++; 
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} 

WritePixel (x,y,value); 

} 

} 

Mid-point Ellipse Scan Conversion 
void MidpointEUipse (int a, int b, int value) 

/* Assumes center of ellipse is at the origin. Note that overflow may occur */ 

/* for 16-bit integers because oft the squares */ 

{ 

double d2; 
intx=0; 
int y = b; 

double dl - b 2 - (a 2 b) + (0.25a 2 ); 

EUipsePoints (x^value); /* The 4-way symmetrical WritePixel */ 

while (a 2 (y - 0.5) > b 2 (x + 1)) { 
if(dl<0) 

dl+=b 2 (2x + 3); 

else { 

dl -H= b 2 (2x + 3) + a 2 (-2y + 2); 

y-; 

} 

x++; 

EllipsePoints(x,y,value); 

} 

d2 = b 2 (x + 0.5) 2 + a 2 (y - l) 2 - a 2 b 2 ; 
while(y>0){ 

if(d2<0){ 

d2 4= b 2 (2x + 2) + a 2 (-2y + 3); 

x++; 

} else 

d2+=a 2 (-2y + 3); 

y~; 

EllipsePoints(x,y,value); 

> 

} 

The bitBlock Transfer Algorithm 

typedef struct { 

point topLeft, bottomRight; 
} rectangle; 

typedef struct { 
cha *base; 




int width; 
rectangle rect; 
} bitmap; 

typedef struct { 

unsigned int bits:32; 
} texture; 

typedef struct { 

char *worldptr; 

int bit; 
} bitPointer; 

voidbitBlt( 

bitmap map 1; 

point point 1; ' 
texture tex; 
bitmap map2; 
rectangle rect2; 
writeMode mode) 

{ 

int width; 
int height; 
bitPointer pi, p2; 

clip x_yalues; 
clip y-values; 

width = rect2.bottomRight.x - rect2.topLeftx; 
height = rect2.bottomRighty - rect2.topLeft.y; 

if (width < 0|| height <0) 
return; 

pl.wordptr = map 1. base; 

pl.bit = mapl jecttopLeftx % 32; 

/* And the first bin in the bitmap is a few bits further in */ 
/* Increment pi unitl it points to the specified point in the first bitmap */ 
IncrementPointer (pi, point 1.x - mapl jrect.topLeftx + map 1. width * 

(pointl.y - mapl.recttopLefty)); 

/* Same for p2 - it points to the origin of the destination rectangle */ 

p2.worldptr = map2.base; 

p2.bit = map2.recttopLeft.x % 32; 

IncrementPointer (p2,rect2.topLeft.x - map2.rect.topLeft.x + 

map2/widrh * (rect2.topLefty - map2.rect.topLeft.y)); 
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if(pl<p2){ 

/* The pointer pi comes before p2 in memory; if they are in the same bitmap */ 
/* the origin of the source rectangle is either above the origin for the */ 
/* above destaination or, if at the same level, to the left of it */ 

IncrementPointer (pi, height * map 1. width + width); 

/* Now pi points to the lower right word of the rectangle */ 

IncrementPointer (p2, height * mapl .width + width); 

/* Same for p2, but the destination rectangle */ 

point 1.x += width; 

point 1 .y += height; 

/* Thios point is now just beyond the lower right in the rectangle */ 
while (height- >0){ 

/* Copy rows from the source to the target bottom to top, right to left */ 

DecrementPointer (pi, mapl. width); 

DecrementPointer (p2, map2.width); 

temp_y = point l.y % 32; /* used to index ,into texture */ 

temp_x = point Lx % 32; 

/* Now do the real bitBlt from bottom right to top left */ 
RowBltNegative (pi, p2, width, BitRotate(tex[temp_y],temp_x), mode); 
} /* while •/ 
}else{/*ifpl>=p2*/ 

while (height- > 0) { 

/* Copy rows fro source to destaination, top to bottom, left to right */ 
/* Bo the real bitBlt, from topleft tpo bottom right */ 
RowBltPositive (same arguments as before); 
increment pointers; 
}/* while*/ 
} I* else •/ 
}/* bitBlt*/ 

void Clip Values (bitmap *mapl, bitmap *map2, point *pointl, rectangle *rect2) 
{ 

if (*pointl not inside *map 1) { 

adjust *pointl to be inside *mapl; 

adjust origin of *rect2 by the same amount; 

} 

if (originof *rect2 not inside *map2){ 

adjiist origin of *rect2 to be inside *map2; 
adjust *pointl by the same amount; 

} 

if (opposite comer of *rect2 not inside *map2) 

adjust opposite comer of *rect2 to be inside; 
if (opposite comer of corresponding rectangle in *mapl not insode *mapl) 

adjust opposite corner of rectangle; 
} /*ClipValues */ 



80 



void RowBltPositi ve( 

bitPtr p 1 , bitPtr p2; /* Source and destination pointers */ 
int n; * /* How many bits to copy */ 

chartword; /* Texture word */ 

writeMode mode) /* Mode to bit pixels */ 

/* C!opy n bits from position pi to position p2 according to the mode */ 
while (n->0){ 

if (BitlsSet (tword,32))/* If texture says it is OK to copy. */ 
* / MoveBit (pl,p2,mode); /* then copy the bit */ 

IncrementPointer (pi); 

IncrementPointer (p2); 

RotateLeft (tword); /* Rotate bits in tword to the left */ 

}/* while*/ 
. } /RowBltPositive */ 



Phong Shading 

double dbl»2.5,db2=65535.,pi; 

int colors!] ={3, 6,10, 13, 6, 3,10,13,6, 3,13,10}, 
d[]»{640, 350,1}, f 

palette t) ={000,010,001, Oil, 020, 002, 022, 077, 
040,004,044,060,006,066,007,077}, 

x , y , x_min , x_max , y_min , y_max ; 
int min,sec; 
unsigned short random; 

main() 

^ double a,b,c,10,ll,12,ln,lnl,n0,nl,n2,p,q,r=128,s,t,v[12] [3] ; 
int n; 

int graphdriver = DETECT, graphmode; 
int color; 

ini tgr aph ( fcgraphdr iver , fcgraphmode , n " ) ; 
/* for (n=0;n<16;n++) */ 

#ifdef Intel 

printf ( n \n\t\t 80387 Phong Shading Demonstration Program\n n >; 

#else 

printf ( n \n\t\t\t Phong Shading Demons t rat ion\n w ) ; 
#endif 

/* printf ( » \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n" ) ; 
start =clock 0 ; */ 

/* Pixel aspect ratio. Original value is 1.3 whic works with EGA*/ 
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/* This is hence the version for my - ThL - EI20 VGA Card * 
a=1.0; 

/* Screen center coordinates */ 

b=0.5*{d[0] -1) ; /* x-position */ 

c=0.5*(d[l] -1) ; /* y-position */ 

/* Unit length light source vector */ 

10=-l/sqrt (3.) j 

11=10; 

12=-10; 

/* Ratio circumference to diameter of a circle */ 
pi=4*atan (1. ) ; 

/* A dozen vertices evenly spread over a unit sphere */ 
vlO] [0]=0; 
v[0] [1]«0; 
v[0][2]=l; 
s=sqrt (5.) ; 
for (iol;i<ll;i++) { 
p=pi*i/5; 

v[i] [0]=2*cos(p)/s; 
v[i] [l]=2*sin(p)/s; 
v[i] [2]»<l.-i%2*2)/s; 

} 

v[ll] £01=0; 
v[ll] [1]=*0; 
v[ll][2]=-l; 

/* Loop to Phong shade each pixel */ 

y_maxec+r ; 

y_min=2 * c -y_max ; 

for (y=y_min ; y < =y_max ; y++ ) { 

s=y-c; 

nl=s/r; 

lnl=ll*nl; 

s=r*r-s*s; 

x_max=b+a*sqrt (s) ; 

x_min=2 *b -xjmax ; 

for (xex_min ; x<=x_max ; x++ ) { 
t=(x-b)/a; 
n0=t/r; 

t=sqrt (s-t*t) ; 
n2=t/r; 

/* Compute dot product and clamp to positive value */ 

In=10*n0+lnl+12*n2 ; 

if (ln<0) ln=0; 

/* cos(e.r)**27 */ 

t=ln*n2 ; 

t+=t-l2; 

t*=t*t; 

t*ot* y t; 

t*=t*t; . 
/* Nearest vertex to normal yields max dot product */ 
/* Get its color */ ' 
for (i=0,p=0;i<ll;i++) 

if (p< <q=»n0*v [i) [0] +nl*v [i] [1] +n2*v [i] 12] ) ) { 

p=q; 

k=colors [i] ; 
}/*end for*/ 

/* Aggregate ambient, diffuse, and spectacular intensities 

do dither */ 
randoms 3 7 * random+1 ; 
i=k-dbl+dbl*ln+t+random/db2 ; 
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/* Clamp values outside range of three color level to black or white */ 

if (i < (k-2)) i»0; 
else 

if (i > k) i=15; 
putpixel (x«y,i) ; 
}/*end for*/ 
}/*end for*/ 

exit :* 

delay(SOOO) ; 
closegraph ( ) ; 

}/*end main*/ 

14 Appendix G 

Algorithm: 

Task schedule (G(V 9 E), CTRL_VARS[N] 9 PE = (PE1.PE2 PEM}) 

For each combination of CTRL JARS do 

^ Generate a DFG Gsub(V,E p CTRL_VARS[I]) which is a sub-graph ofG(V,E). Only the nodes 
and edges in the control flow corresponding to the current combination of CTRL_VARS are 
included in this sub-graph. 

Generate the PCP schedule of Gu Let the schedule be PCPjschedfl] and the delay be 
PCPjielay[I]. 

Sort PCPjsched and PCPjielay and Gsub in decreasing order ofPCPjielayPJ. 

Generate the Branch and bound schedule for GsubfO], the sub-graph with the worst PCPjielay. 
Let the schedule be BB_sched[I=0] and the delay be BBJLelay[I=0]. 
Initialize worst J>bjlelay = BBJLelay[0] 

J 

For all the other sub-graphs do 

if (PCPjielay [I] < worst Jbb delay) then 
BBjschedffl =PCPjschedffl; 
BBJLelay[I] ^PCPJelayP] ; 

else 

Generate BB_schedfI] and BB_delay[I] ; 
If(BBjielay[I] > worstJbb_delay[I]) then 
Worstjbbjielay = BBjielayp]; 

} 

Generate the branching tree with the help of the G(V,E). In this branching tree, the edge 
represents the choices (K and K') and the node represents the variable (K) 
Initialize the current path to the one leading from the top to the leaf in such a way that the DFG 
corresponding to this path gives the worst Jbbjielay. The path is nothing but a list of edges 
tracing from the top node till the leaf. 
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ARTIFACT SHEET 



Enter artifact number below, Artifact number is application number + 
artifact type code, (see list below) + sequential letter (A, B, C . . :). The first , 
artifact folder for an artifact type receives the letter A, the second B, etc.. 
Examples: 59123456PA, 59123456PB, 59f23456ZA, 59123456ZB 



*** 



Indicate quantity of a single type of artifact received but not scanned: Create 
individual artifact folder/box and artifact number for each Artifact Type.* . : : 



CD(s) containing computer program, listing ... 
Dpc Code: Computer Artifact. Type Code: P 

- Stapled Set(s) of E^xtra.Colpr Pra^gs^nptographs 

■■' ■ 

CD(s) containing pages of specification [ [ 
and/or sequence listing pi " * 

Doc Code: Artifact s — 5 



Artifact Type Code: S 



□ 
□ 

□ 

[2 



CD(s) with content mispecified 

Doc Code: Artifact Artifact Type Code: U 

Microfilm(s) 

Doc Code: Artifact Artifact type Code: F 

Videotapes). 

Doc Code: Artifact Artifact Type Code: V 

Model(s) 

Doc Code: Artifact Artifact Type Code: M 

z 

Bound Document(s) . f 
Doc Code: Artifact Artifact Type Co'de: B V 



Other, description: _ 
Doc Code: Artifact 



-TVp t?^-r-» di^L 
Artifact Type Code: Z 



06/26/2003 



This Page is Inserted by IFW Indexing and Scanning 
Operations and is not part of the Official Record 

BEST AVAILABLE IMAGES 

Defective images within this document are accurate representations of the original 
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